mhonarc-users

reproducible URLs

1998-09-10 01:44:14

A while back there was a discussion of reprodicible URLs (to avoid
messing up search engines when mail gets re-archived) and issues
surrounding randomness, probabilities, MD-5, message-ID, and 8.3
filenames.

Anyway, sorry I didn't jump in then, but the kind-of-fun question was
implicitly raised: how many bits of randomness do you need for
reproducible URLs in MHonArc?  (Hey, it's not every day that real life
questions can be tackled like problem sets!)

We know that the more messages there are, the more likely
that there will be a duplicated filename.  So, lets assume file names
have x bits of randomness and there are n messages. The probability of
collision is

    n
    --                                n
    \                                /
    /   i       which is            |   i di          2
    --          approximately       /                n
   i=0                              0          =  ------
 -----------                     ------------       x+1
       x                             x             2
     2                              2

which is the total likelihood of collisions over the total
number of possibilities (often called the sample space.)

So we want a low chance of collision for any likely size of n.  If
n=10^6 (about 2^20) and we want a one-in-a-hundred-thousand odds of
collision for such and extreme case, then x comes out to about 56.
That's 56 bits of randomness.

Well, in an 8.3 filename, with no case sensitivity, and only using
numbers and letters, we get over 57 bits of randomness to play with,
using all 11 characters. No problem.

Now if we are restricted to ending the filenames with something like
.htm, then there are only about 41 bits of randomness, and then we
run about 1% risk of collision for a puny n=100,000 message archive.
That's pushing it.

Ok, one last note. If we use a real filesystem, with upper and lower
case letters in the filenames, we'd still need 10 characters in the
filename to meet/exceed the acceptable saftey margin (57 bits). So
those lower case letters don't help us much in the region we are
interested in.

Using MD-5 checksums for filenames is complete overkill statisticly
speaking. They are 128 bits, and would consume 20-odd characters in
the filename. 10 character filenames would do the trick nicely. There
is certainly no need to combine MD-5 and message-ID's from a
statistical standpoint.

Okay, sorry for the babbling! It's was the repressed student inside
me.

Jeff

<Prev in Thread] Current Thread [Next in Thread>