Re: reproducible URLs

On September 3, 1998 at 13:10, Claire McNab wrote:

For this reason I prefer MD5 sums of the message body -- there is
statistically only a 1:18446744073709551616 chance of matching a
false positive.  For anyone interested, I've written some sendmail
8.9.1 patches to add md5sums at the sendmail level (based on the work of
Martin Hamilton) and also have a procmail recipe to do the same.


This sounds like a useful issue to tackle.  I've been hit by it a few 
times, when from other parts of my site, or in in other messages to 
the list, I have referred to articles by filename ... only to find 
that later, when I have rebuilt the archives to roll out new .rc 
files, the filename has changed :(

However, I wonder about the MD5 method.  Without knowing anything 
about MD5, could it work with 8.3 filenames?  I buid my archives on a 
DOS box, so am constrained to that format.


Such a method will not work under 8.3 filenames.  The current method
is friendly to 8.3 systems.

I am also not concrned about the lack of message-IDs: this problem 
becomes visible quite quickly in my setup, as articles are repeatedly 
added to the database on each archive build.  When I spot this, I 
just edit the mbox file and add a message-id of the form
poster's_name_YYMMDDHHMMSS_something_random(_at_)no-valid-msg-id
(I know this is a prob for others, and I recognise the difficulty -- 
I'm just saying its not a prob for me, though I hope it would be 
supported for the benefit of others, esp those with more heavily 
automated systems).


v2.3 will create a message-id for messages w/o one.  The id has
the string "NO-ID-FOUND" in it so one can tell the id was generated
be MHonArc.

So it occurred to me that one way of implementing this would be to 
create a new .db file (e.g. filename.db), which would record the 
filenames used for each message ID and for each MD5 sum.  That way 
the chances of a duplicate occurring are *very* low: it would require 
a duplicate MD5 sum *and* a duplicate or missding msg-id.

AFAICS, mhonarc.db is wiped when the archives are rebuilt ... and all 
that would be needed is to ensure that filename.db is not wiped on a 
rebuild, and its data reused.   That way, we could retain the current 
flexibility of filename format (which has other advantages, such as 
being reasonably transparent) and add permanency.

How does that sound?


Changing the v2.x code base to support different filenames from the
current convention will take some work.  Also, if such a feature
were to be added to v2.x, the current filename style should still be
supported.  I.e.  Alternate schemes would be triggered by a resource.

Using messsage-ids (or MD5 sums) is something I will look into
for v2.x, but after v2.3 is released.

        --ewh

----
             Earl Hood              | University of California: Irvine
      ehood(_at_)medusa(_dot_)acs(_dot_)uci(_dot_)edu      |      Electronic 
Loiterer
http://www.oac.uci.edu/indiv/ehood/ | Dabbler of SGML/WWW/Perl/MIME