On Fri, 01 Jun 2001 17:43:47 -0700
Earl Hood <ehood(_at_)hydra(_dot_)acs(_dot_)uci(_dot_)edu> wrote:
On May 31, 2001 at 20:54, J C Lawrence wrote:
Good point. As the ultimate goal is to shove the entire message
base into an SQL DB (I've got users begging for things like
thread-bounded searches and the ability to gen meta views of an
archive), I'll probably head that way.
While its a gruesome hack, I'm ultimately looking to use MHonArc
as a front end processor which writes scripts as output which are
then executed to input the message and all its particulars inputs
into an SQL DB. What I haven't figured out yet is how to
properly extract the thread linkings for input into the DB, as
well as how to effectively (ie scalably) provide the thread
database to MHonArc when archiving a message (we're talking
hundreds of thousands of messages, possibly small order
Its on my TODO list to allow callback hooks during MHonArc
processing. The problem is that to allow a decent callback API,
some of the internal functions need changing. Something for
probably a 2.5 release (whenever that is).
With a hook, you can store the message-ids and
references/in-reply-to data in a DB, and then compute the threads
from that. This is basically what MHonArc does.
How does MHonArc currently attempt to thread messages which are
missing In-Reply-To/References headers, but which share date and
subject strings with an extant thread?
At that point my main interests in MHonArc are its excellant MIME
and charset handling (damned fine job BTW). I'd like to also use
it to build the thread graph rather than dynamically building it
off the References/In-Reply-To headers dynamically as MHonArc
properly handles the matching-subject thread hits.
With the current code base, you can access the thread listing
There are multiple approaches, but one is creating a custom
mhonarc that does a dump of thread data after an archive update in
some format you need. Two main variables are created when
generating the thread data: @TListOrder and %Index2TLoc. The
first is a list of message indexes in the order to be rendered on
a thread index page. The second is a hash that maps a message
index the ordinal thread index position (useful in resource
Also generated is the %ThreadLevel hash. This maps a message
index to the thread depth of the message. A depth of 0 means it
is a root-level message. Therefore, with @TListOrder and
%ThreadLevel one can infer the thread tree structure.
Yup. My problem is that both of these data sets suffer badly when
they get large (eg ~500K - 1Million messages). Methinks I'll have
to shove all the thread data into the external DB and then attempt
to to MHonArc either build deltas against it (eg the dump you
mentioned above), have MHonArc do the heavy lift in determinging
insertion points for unthreaded messages which look like thread
members and do everything else at HTML page generation time.
These structures are a sequential way of representing message
threads, but is conduscive to generating the HTML thread index
pages since that is done in a sequential manner. Also, in Perl 4
days, doing complex tree structures was a non-trivial task.
I believe. I've been dumbstruck to find that PHP does not have an
ordered collection type which allows insertions (eg a list, vetor,
etc). The've got associative arrays and objects as their derived
types, and then a fairly simple set of base types.
Building one is going to be un-fun. Best I can think of so far is
doing evil work with associative array key generation, but that
*really* is evil.
BTW, the following is a snippet from mhinit.pl:
Unfortunately, my memory needs refreshing on all the threading
stuff, so I'm probably forgetting something. The multi-page index
support does complicate some of the stuff (hence the
Yeah, I'll have to dig into this. Thanks.
J C Lawrence claw(_at_)kanga(_dot_)nu
The pressure to survive and rhetoric may make strange bedfellows