Re: persistently linking to within archives

1999-11-23 16:37:44
Okay.  A few choice documents later, I'm on a much firmer footing.  I
want to belatedly clarify where I'm coming from.

My core assumption is this: I'm implementing a system designed to store
and retrieve mailing list metadata, for use with current versions of
MHonArc, specifically targeting unix platforms.  I'm not going out of my
way to break windows, but I don't have any wintel systems handy and I
don't know windows well.

I'm using Perl, because of MHonArc's lead, because of platform
neutrality, and because Perl makes me happy.

My primary contribution is a simple, relational data model for storing
multi-archive metadata.  (I happily realized a couple days ago that I
was simultaneously working on the subject & authors project I mentioned
on 10/24.)  Whether the data model is implemented with a Perl
DBI-compliant DBMS, or with DB_File, should not be surpassingly

Secondary contributions, such as tools used to provide data to the
database, and CGIs used to query the database, are expected to be
readily portable between implementations of the data model, and may be
partially/mostly applicable to other systems similar to MHonArc, such as
hypermail.  However, this is just a happy accident.

I'm doing this primarily for myself, because my archives need this
functionality.  I intend to have it fully functional by the first of the
year, pending the survival of civilization.  I recognize the astonishing
diversity of MHonArc installations - I had a good giggle when Mr.
Breidenbach asked if MH could handle more than 100,000 messages in an
archive, considering as I do a 2,500 message archive to be on the hefty
side - I don't know how universal my solution is going to be,
particularly in the early stages.

The SDSC SRB is many times thicker than anything I had envisioned --
it's a project informed by their staggeringly large data sets (I don't
recall ever having seen petabytes mentioned unironically before). 
However, it is sort of a parallel project, and I'm having a good time
poring over the documentation and identifying things to borrow, both
short and long term.  I am grateful for the reference.

On 11/21/99 at 9:54 PM, asgilman(_at_)iamdigex(_dot_)net (Al Gilman) wrote:

If you are going to mix Message-ID's with MD5 hashes in the same
syntax, how would you be assured of non-conflicting values?

RFC 822 doesn't specify any means of defining the unique part of a
Message-ID; it just recommends that they not repeat within at least two
years at a single site.  Between a 128-bit hash and a specific,
fictional hostname, even if the MD5 mechanism is used uniformly across a
colossal site, the odds of a collision should still be lower than
hitting the lottery.  No?

I have run across one case where the message-ID is not sufficient to
determine the message's uniqueness -- when we start talking about
spanning multiple MHonArc archives, we run across the possibility
that the same message will be legitimately sent to multiple lists.  

Yes, then you are looking to find not a message but a location in the
thread structure.  In the above multi-parameter keying scheme you
could simply add archive=thread to the &mid=foo(_at_)bar to make the URL
return what you want.

Well, the user is still looking for a message, if by virtue of its
location in the thread structure.  Once it's found, they will presumably
be comfortable using MHonArc's own tools to navigate the thread.  The
problem is that having submitted only a message-ID, they haven't yet
provided enough information for the system I outlined previously (wholly
dependent on the message-ID as a unique, identifying key) to send them
where they want to go.

It seems clear, for this and other reasons, that sites will have to
establish and maintain list namespaces if they want to support what
would otherwise be tagged as redundant messages, and reduced to a single

It's not much of a factor in MHonArc's multi-archive indexing because
the latter doesn't exist yet.

Well, it exists on my system, after a fashion... it's not very smart
yet, but it's coming along.

In other words, the highest and best solution to your problem is a
pragma applicable to mid: URLs.

Having now gone over RFC 2392, I am very curious as to whether a mid:
URL has ever been observed in the wild.

They seem to observe the same server-independence as mailto or news
URLs, requiring equivalent investment in client-side infrastructure to
be made useful.  I suppose that what I'm proposing is essentially a
message-ID-centric middleware server, but I have no ambitions to either
supercede HTTP or start building client software.

Should mid-speaking client software evolve, either standing alone or as
functionality of an MUA, it should be easy to adapt this data model and
associated bits of middleware (mid-middleware?) to interoperate with it.