Threading algorithm

2001-08-22 15:34:12
Forgive me if this is a topic that has been discussed before; I tooled
around on the FAQ and and didn't see it, and searches on
the web archives all keyed in on the word "thread", which turned up
useless (because the word "Thread" is on every page).


We've been using mhonarc for our mailing lists for several years, and
we've been pretty happy with it.  We've kept up with updates, etc.; we're
running mhonarc 2.4.9.

One of our mailing list archives is fairly large; it spans several years,
and currently has 32 index pages.

On this list, people tend to repeat the same subject lines over, even
though they are in completely different threads (it's a tech help kind of
list).  So subjects like "lamboot problem" or "problem with mpirun" are
common, even though they are unrelated messages.

According to the Mhonarc documentation
"Threads are based upon In-Reply-To and References fields of messages and
by same Subjects."  This can be a problem when people use the same subject
lines in unrelated messages.

What happens is that a totally new message (i.e., the first message in a
new thread) that just happens to have the same subject as some previous
message will get threaded under the old thread.  This could be a message
that was sent long ago, such that the new message will get buried in a
high-numbered thread index page, making it difficult (if not impossible)
to find.

Of course, the new message shows up properly in the date index.

Is there any way to disable or tune the subject-matching aspects of the
threading algorithm?  I realize that disabling subject-matching may
mis-thread some messages that *should* be threaded (e.g., from icky mail
clients that don't include a In-Reply-To or References lines), but I think
that I would prefer such messages show up as a top-level post rather than
have unrelated messages show up buried in a very old thread index.

Or perhaps it would be possible to do threading solely based on subject
matching (in the absence of In-Reply-To or References) only on some
maximum amount of time -- e.g, in the absence of IRT of Ref lines, only
thread a new message to an old message if:

- the subject matches
- the date difference between the new and old message is < N days (perhaps
  N can be user-definable)

These are just ideas off the top of my head.  Any help would be
appreciated.  Thanks.

{+} Jeff Squyres
{+} squyres(_at_)cse(_dot_)nd(_dot_)edu
{+} Perpetual Obsessive Notre Dame Student Craving Utter Madness
{+} "I came to ND for 4 years and ended up staying for a decade"

