Re: reproducible URLs

I don't know if this would be a workable solution, but I wrote a little
perl script that uses "base 36" (a-z,0-9) to convert a hh:mm:ss yy:mm:dd
timestamp into 7 characters that could be stuffed into a DOS filename.  The
eighth character was used for differentiation if a file already existed
with the same initial 7 characters -- i.e., if two files had the same
7-digit timestamp, the second one had an "a" appended.  Then if there
was another dupe, the third would have "b" appended, etc.  Thus you
could have up to 37 files with the same timestamp in the same directory.

It it's of any value, I'd be happy to make the code available (it's ugly,
but it works).

John Ackermann
jra(_at_)febo(_dot_)com
http://www.febo.com
----
In message 
<000101bddcb8$80d0c260$ba8bec84(_at_)romano-nyswri(_dot_)cfe(_dot_)cornell(_dot_)edu>,
 "Ste
ve Pacenka" writes:

Your message was fun, Jeff.

I reasoned this out similarly, thinking along the lines of base-64 used in
MIME.  The permissible character set for DOS-platform file names contains at
least 46 characters.  The number of different names expressible in 8 base-46
characters is sufficient to have a minuscule collision probability for
archives of any reasonable size.  A 100,000 message archive seems two orders
of magnitude too high for MHonArc's basic design; anything that large using
a filesystem as its database needs to be organized hierarchically.  That
would add a subdirectory namespace into the quota.

-- SP

Anyway, sorry I didn't jump in then, but the kind-of-fun question was
implicitly raised: how many bits of randomness do you need for
reproducible URLs in MHonArc?  (Hey, it's not every day that real life
questions can be tackled like problem sets!)

Now if we are restricted to ending the filenames with something like
.htm, then there are only about 41 bits of randomness, and then we
run about 1% risk of collision for a puny n=100,000 message archive.
That's pushing it.

Ok, one last note. If we use a real filesystem, with upper and lower
case letters in the filenames, we'd still need 10 characters in the
filename to meet/exceed the acceptable saftey margin (57 bits). So
those lower case letters don't help us much in the region we are
interested in.

Using MD-5 checksums for filenames is complete overkill statisticly
speaking. They are 128 bits, and would consume 20-odd characters in
the filename. 10 character filenames would do the trick nicely. There
is certainly no need to combine MD-5 and message-ID's from a
statistical standpoint.