This is superb, but my statistics are rusty enough that I'd appreciate
your advice on how to interpret it. Specifically, isn't the key
statistic I want the bucket count for the most frequently occurring
octet divided by the size of the file divided by 900? As long as that
number is relatively low, shouldn't I be safe from exceeding the 998
limit?
Or, more specifically, Ned and I are now thinking of escaping the octets
NUL, CR, LF, TAB, SP, and "=" (the SP and TAB prevent potential
corruption on some buggy MTAs, and the "=" is used as the escaping
character). Would it be possible for you to tell me, across all runs of
900 octets across of all of your files, what is the largest count of
those 6 octets that you see. The average count (I believe) should be
6/256 * 900, or 21. But, what's the max? If it's very close to 49,
than we should probably choose a line count less than 900.
The other alternative is to expect encoders to be a little smarter to
always have a 998 line count rather than having it be probabilistic.
The encoder could process each octet one at a time, escape the six that
need to be, output the result while incrementing a counter by the number
of octets outputted (either 1 or 2), add a CRLF when the counter reaches
997, and then reset the counter.
BTW, does anyone know of NNTP or 8BitMIME ESMTP transport
implementations that do not support 998 octets per line?
Also, I've taken the liberty of cc'ing ietf-822, which is the best place
to discuss a new CTE.
- dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/> <tel:+1-650-327-2600>
-----Original Message-----
From: Lyndon Nerenberg [mailto:lyndon(_at_)orthanc(_dot_)ab(_dot_)ca]
Sent: Sunday, January 26, 2003 13:39
To: Dan Kohn
Subject: Re: gzip-8bit
900 is chosen as the unescaped octet stream length because RFC 2822
prohibits lines to exceed 998 octets, plus the ending CRLF. Since the
number of octets that will be escaped is not known, 900 seemed to
provide a large amount of margin (i.e., room for 49 escaped octets).
Since each octet output by gzip should be approximately equally likely,
and there are 256 possibilities, there should be on average 4 / 256 *
900 = 14 octets per line that need to be escaped.
The octet frequency distribution in gzipped files isn't as flat as you
might imagine. I ran a histogram analysis against a selection of gzipped
files I had at hand to determine just how flat the distribution of
octets is. The results are below, formatted as follows:
Column 1: size of the file, in octets.
Column 2: the bucket count for the least frequently occurring octet.
Column 3: the bucket count for the most frequently occurring octet.
Column 4: the mean of the bucket counts.
Column 5: the standard deviation of the bucket counts.
Column 6: the standard deviation divided by the mean.
607 0.000000 9.000000 2.371094 1.697593
.71
655 0.000000 10.000000 2.558594 1.717467
.67
708 0.000000 9.000000 2.765625 1.695511
.61
899 0.000000 10.000000 3.511719 1.932418
.55
899 0.000000 10.000000 3.511719 1.932418
.55
899 0.000000 10.000000 3.511719 1.932418
.55
1179 0.000000 12.000000 4.605469 2.244048
.48
1211 0.000000 11.000000 4.730469 2.141847
.45
1511 1.000000 16.000000 5.902344 2.398371
.40
1660 1.000000 15.000000 6.484375 2.666295
.41
2815 4.000000 22.000000 10.996094 3.487139
.31
3357 4.000000 23.000000 13.113281 3.710055
.28
8573 19.000000 53.000000 33.488281 5.825578
.17
9380 23.000000 64.000000 36.640625 6.731148
.18
9435 17.000000 65.000000 36.855469 6.604820
.17
9435 17.000000 65.000000 36.855469 6.604820
.17
16366 42.000000 87.000000 63.929688 8.777798
.13
20868 58.000000 121.000000 81.515625 9.744377
.11
21003 54.000000 145.000000 82.042969 10.826134
.13
21003 54.000000 145.000000 82.042969 10.826134
.13
21804 49.000000 169.000000 85.171875 12.860495
.15
21804 49.000000 169.000000 85.171875 12.860495
.15
23601 70.000000 125.000000 92.191406 10.737933
.11
31060 93.000000 160.000000 121.328125 13.064210
.10
34068 96.000000 173.000000 133.078125 13.709662
.10
50087 127.000000 246.000000 195.652344 18.090923
.09
58335 177.000000 298.000000 227.871094 18.606613
.08
58885 175.000000 285.000000 230.019531 17.314971
.07
70085 218.000000 334.000000 273.769531 22.023372
.08
80865 246.000000 381.000000 315.878906 22.972066
.07
113342 381.000000 509.000000 442.742188 26.927781
.06
115182 385.000000 633.000000 449.929688 31.520872
.07
152701 503.000000 684.000000 596.488281 30.160506
.05
152701 503.000000 684.000000 596.488281 30.160506
.05
188502 663.000000 891.000000 736.335938 36.784375
.04
198948 686.000000 993.000000 777.140625 36.907027
.04
213189 680.000000 1037.000000 832.769531 58.903702
.07
247590 831.000000 1261.000000 967.148438 47.645188
.04
288760 913.000000 1256.000000 1127.968750 55.693418
.04
311240 884.000000 1667.000000 1215.781250 112.340716
.09
311240 884.000000 1667.000000 1215.781250 112.340716
.09
377272 1307.000000 1602.000000 1473.718750 55.318317
.03
388020 1358.000000 2170.000000 1515.703125 73.552740
.04
485364 1713.000000 2170.000000 1895.953125 75.495536
.03
517284 1736.000000 2366.000000 2020.640625 94.305706
.04
538505 1893.000000 2489.000000 2103.535156 82.821997
.03
560353 1989.000000 2648.000000 2188.878906 85.107268
.03
560353 1989.000000 2648.000000 2188.878906 85.107268
.03
750234 2357.000000 3343.000000 2930.601562 194.399168
.06
872048 2785.000000 4867.000000 3406.437500 401.799473
.11
879673 2986.000000 3998.000000 3436.222656 135.385421
.03
963682 3087.000000 5098.000000 3764.382812 354.615937
.09
997920 2281.000000 7349.000000 3898.125000 461.347009
.11
1046770 3596.000000 5469.000000 4088.945312 284.685800
.06
1112976 3989.000000 7686.000000 4347.562500 254.226423
.05
1327104 3862.000000 6749.000000 5184.000000 468.322080
.09
1687669 5924.000000 7186.000000 6592.457031 204.379086
.03
1873351 6614.000000 8776.000000 7317.777344 261.643008
.03
2072772 7342.000000 10161.000000 8096.765625 305.602913
.03
2158683 7677.000000 10622.000000 8432.355469 301.601096
.03
2564291 9426.000000 11323.000000 10016.761719 281.364730
.02
3348820 9985.000000 16478.000000 13081.328125 1147.182354
.08
5772632 20494.000000 29802.000000 22549.343750 902.669842
.04
27407781 103531.000000 117263.000000 107061.644531
2093.466687 .01
34950972 124450.000000 209336.000000 136527.234375
6884.022945 .05
37495739 130774.000000 178224.000000 146467.730469
5398.963450 .03
42047988 144620.000000 201436.000000 164249.953125
9260.798167 .05
195211548 685275.000000 941261.000000 762545.109375
37221.965966 .04
224484882 787278.000000 1017605.000000 876894.070312
34405.892708 .03
596657226 2131931.000000 2887000.000000 2330692.289062
89687.773147 .03