nmh-workers
[Top] [All Lists]

[Nmh-workers] A script for threading

2006-01-14 21:29:28
This script is both for use and for discussion regarding the final form
of a threading implementation. It is written (mostly) in awk, and it's
worth pointing out that a perl module already exists which implements the
algorithm I've used a part of, but I wouldn't have learned as much from
just using that module.

Firstly, this script might be usable, if not useful, as is. All it does
is format, in a very basic way, the threads from a "scan" listing, and is
called "spin". It certainly won't do any damage, as it only processes
the output of a call to "scan", so the worst that can happen is a coredump
or non-termination. Neither should happen though.

Arguments to spin go to scan, and should only be message numbers,
ranges, sequences, and optionally a folder name. Giving a switch for scan
might do weird things, since spin uses a special -format to do
its job, and a very bad/basic hack to -width for its output. It puts those
last, however, so that should override any -format or -width you give it,
anyway.

The location of awk, scan, and the scan format are hardcoded into
the script. You might want/need to change these. The script is also not
as efficient as it could be. I offer no apology for that at this stage. :)
The point of the exercise was to expose implementation issues, and do
a prototype/proof of concept kind of thing. The output routine, in
particular, doesn't need to be recursive, but that's a very small price
to pay for the ease with which the pretty printing could be beefed up
later.

The formatting is very basic, and one thing that might look a little
weird is when multiple messages are replies to a root message that
no longer exists. The output will be reordered to group those replies,
but for no obvious reason, since the script doesn't print dummy lines for
non-existent roots. Introducing that would be relatively easy.

It is also possible the script has bugs. :) I don't have a large amount
of email threads archived to test it on. Weird things will happen with
really nasty message-id, in-reply-to, and references headers, but what
I've written probably sanitizes those as well as most other software.

Now, onto the real discussion.

The reason I've done this is to familiarise myself with the issues regarding
a threading implementation and an integration of it. The script adapts the
part of the algorithm described at

http://www.jwz.org/doc/threading.html

which deals with references and in-reply-to. It does not group subjects.
This can be added later, but I wanted to keep it simple for the moment.

The most important adaptation of the algorithm, and the one that I
would recommend very strongly should threading ever be built into nmh,
is to preserve the existing order of messages in a folder when threading.
This provides a very natural separation between sorting and threading,
leaving the former to sortm. I have come to believe, very strongly, that
sortm should not know anything about threads. jwz briefly discusses
the distinction between sorting and threading too.

The fact that the script does its job by processing one pass from scan
is the key to the issue of whether the functionality should be built into
scan. Threading requires construction of not-insignificant data structures
after having read *all* the messages being threaded. From an implementation
point of view I believe this is substantially different to anything
scan does currently, and so might be a good reason not to put this in
scan, even though from an interface point of view it just looks like a
matter of pretty-printing scan's output. On the other hand, more elaborate
pretty printing might need some scan internals. I don't really know.

Thread navigation is a different matter. Part of the reason I'm hoping
people will use this script (at least those interested in threading) is so
they can get a feel for the kind of operations they'd like to have available.
It's not obvious what would fit naturally into nmh. It would be easy to
generate sequences according to some thread-based criteria, and so perhaps
threading needs to be incorporated into pick, but like scan I don't think
pick does anything so complex at the moment. It would be easier for those
sequences not to be threaded, though, because they can be passed to spin
or a thread-grokking pick for further sub-thread work. That wouldn't be
so expensive.

One operation that is perhaps immediately desirable is to thread "sort"
(I know, it sounds like I'm contradicting myself) the folder so that
next agrees with the output of spin. This can be done with spin and
refile. sortm the folder first, to get the sibling orders the way you
want, and then spin it, outputting only the message numbers. This can be
done by altering the SCANLINE_FORMAT of the script or by picking out the
message number field of the normal spin output (whatever that might be).
Refiling each of those numbers as they come out will produce a folder in
the spun order.  Refiling to the *same* folder will even do that.

Anyway, that's all off the top of my head. I hope those who are interested
in this issue find the time to play with the script and consider what
I've written above.

Cheers,

        - Joel
#!/bin/sh

AWK=/usr/bin/awk

SCANLINE_FORMAT=`/bin/cat /etc/nmh/scan.default`

TERMINAL_WIDTH=`/bin/echo "$TERMCAP" | $AWK 'BEGIN { RS=":" } /^co#/ { print 
substr($0, 4, length($0)); flag=1 } END { if(!flag) print 80 }'`

/usr/local/bin/scan $* -width 4096 -format "<fake-root-id> %{references} 
%{in-reply-to} %{message-id}\n$SCANLINE_FORMAT\n" | $AWK '

function pruneshow(x, prefix, i) {
        if(x in scanline) {
                print prefix substr(scanline[x], 1, width-1-length(prefix))
                prefix=prefix "   "
        }
        for(i=0; i<lastkid[x]; i++)
                if((x, i) in kids)
                        pruneshow(kids[x, i], prefix)
}

BEGIN {
        scanline["<fake-root-id>"]="THIS SHOULD NOT APPEAR"
}

{
        i=0
        for(j=1; j<=NF; j++)
                if("<"==substr($j, 1, 1) && ">"==substr($j, length($j), 
length($j)))
                        $++i=$j
        NF=i
        midsuffix=""
        while(($NF midsuffix) in scanline)
                midsuffix++
        getline scanline[$NF=$NF midsuffix]
        if($NF in parent) {
                delete kids[parent[$NF], kidnumber[$NF]]
                delete parent[$NF]
        }
        for(ancestor=$(i=1); i<NF; ancestor=$++i)
                if(!($(i+1) in parent))
                        while(ancestor!=$(i+1))
                                if(ancestor in parent)
                                        ancestor=parent[ancestor]
                                else {
                                        kids[parent[$(i+1)]=$i, 
kidnumber[$(i+1)]=lastkid[$i]++]=$(i+1)
                                        break
                                }
}

END {
        for(i=0; i<lastkid["<fake-root-id>"]; i++)
                if(("<fake-root-id>", i) in kids)
                        pruneshow(kids["<fake-root-id>", i])
}

' width=$TERMINAL_WIDTH
_______________________________________________
Nmh-workers mailing list
Nmh-workers@nongnu.org
http://lists.nongnu.org/mailman/listinfo/nmh-workers
<Prev in Thread] Current Thread [Next in Thread>