ietf-822
[Top] [All Lists]

RE: gzip-8bit

2003-02-26 16:04:37

This is superb, but my statistics are rusty enough that I'd appreciate
your advice on how to interpret it.  Specifically, isn't the key
statistic I want the bucket count for the most frequently occurring
octet divided by the size of the file divided by 900?  As long as that
number is relatively low, shouldn't I be safe from exceeding the 998
limit?

Or, more specifically, Ned and I are now thinking of escaping the octets
NUL, CR, LF, TAB, SP, and "=" (the SP and TAB prevent potential
corruption on some buggy MTAs, and the "=" is used as the escaping
character).  Would it be possible for you to tell me, across all runs of
900 octets across of all of your files, what is the largest count of
those 6 octets that you see.  The average count (I believe) should be
6/256 * 900, or 21.  But, what's the max?  If it's very close to 49,
than we should probably choose a line count less than 900.

The other alternative is to expect encoders to be a little smarter to
always have a 998 line count rather than having it be probabilistic.
The encoder could process each octet one at a time, escape the six that
need to be, output the result while incrementing a counter by the number
of octets outputted (either 1 or 2), add a CRLF when the counter reaches
997, and then reset the counter.

BTW, does anyone know of NNTP or 8BitMIME ESMTP transport
implementations that do not support 998 octets per line?

Also, I've taken the liberty of cc'ing ietf-822, which is the best place
to discuss a new CTE.

          - dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/>  <tel:+1-650-327-2600> 

-----Original Message-----
From: Lyndon Nerenberg [mailto:lyndon(_at_)orthanc(_dot_)ab(_dot_)ca] 
Sent: Sunday, January 26, 2003 13:39
To: Dan Kohn
Subject: Re: gzip-8bit 


900 is chosen as the unescaped octet stream length because RFC 2822
prohibits lines to exceed 998 octets, plus the ending CRLF.  Since the
number of octets that will be escaped is not known, 900 seemed to
provide a large amount of margin (i.e., room for 49 escaped octets).
Since each octet output by gzip should be approximately equally likely,
and there are 256 possibilities, there should be on average 4 / 256 *
900 = 14 octets per line that need to be escaped.

The octet frequency distribution in gzipped files isn't as flat as you
might imagine. I ran a histogram analysis against a selection of gzipped
files I had at hand to determine just how flat the distribution of
octets is. The results are below, formatted as follows:

Column 1:  size of the file, in octets.
Column 2:  the bucket count for the least frequently occurring octet.
Column 3:  the bucket count for the most frequently occurring octet.
Column 4:  the mean of the bucket counts.
Column 5:  the standard deviation of the bucket counts.
Column 6:  the standard deviation divided by the mean.

607     0.000000        9.000000        2.371094        1.697593
.71
655     0.000000        10.000000       2.558594        1.717467
.67
708     0.000000        9.000000        2.765625        1.695511
.61
899     0.000000        10.000000       3.511719        1.932418
.55
899     0.000000        10.000000       3.511719        1.932418
.55
899     0.000000        10.000000       3.511719        1.932418
.55
1179    0.000000        12.000000       4.605469        2.244048
.48
1211    0.000000        11.000000       4.730469        2.141847
.45
1511    1.000000        16.000000       5.902344        2.398371
.40
1660    1.000000        15.000000       6.484375        2.666295
.41
2815    4.000000        22.000000       10.996094       3.487139
.31
3357    4.000000        23.000000       13.113281       3.710055
.28
8573    19.000000       53.000000       33.488281       5.825578
.17
9380    23.000000       64.000000       36.640625       6.731148
.18
9435    17.000000       65.000000       36.855469       6.604820
.17
9435    17.000000       65.000000       36.855469       6.604820
.17
16366   42.000000       87.000000       63.929688       8.777798
.13
20868   58.000000       121.000000      81.515625       9.744377
.11
21003   54.000000       145.000000      82.042969       10.826134
.13
21003   54.000000       145.000000      82.042969       10.826134
.13
21804   49.000000       169.000000      85.171875       12.860495
.15
21804   49.000000       169.000000      85.171875       12.860495
.15
23601   70.000000       125.000000      92.191406       10.737933
.11
31060   93.000000       160.000000      121.328125      13.064210
.10
34068   96.000000       173.000000      133.078125      13.709662
.10
50087   127.000000      246.000000      195.652344      18.090923
.09
58335   177.000000      298.000000      227.871094      18.606613
.08
58885   175.000000      285.000000      230.019531      17.314971
.07
70085   218.000000      334.000000      273.769531      22.023372
.08
80865   246.000000      381.000000      315.878906      22.972066
.07
113342  381.000000      509.000000      442.742188      26.927781
.06
115182  385.000000      633.000000      449.929688      31.520872
.07
152701  503.000000      684.000000      596.488281      30.160506
.05
152701  503.000000      684.000000      596.488281      30.160506
.05
188502  663.000000      891.000000      736.335938      36.784375
.04
198948  686.000000      993.000000      777.140625      36.907027
.04
213189  680.000000      1037.000000     832.769531      58.903702
.07
247590  831.000000      1261.000000     967.148438      47.645188
.04
288760  913.000000      1256.000000     1127.968750     55.693418
.04
311240  884.000000      1667.000000     1215.781250     112.340716
.09
311240  884.000000      1667.000000     1215.781250     112.340716
.09
377272  1307.000000     1602.000000     1473.718750     55.318317
.03
388020  1358.000000     2170.000000     1515.703125     73.552740
.04
485364  1713.000000     2170.000000     1895.953125     75.495536
.03
517284  1736.000000     2366.000000     2020.640625     94.305706
.04
538505  1893.000000     2489.000000     2103.535156     82.821997
.03
560353  1989.000000     2648.000000     2188.878906     85.107268
.03
560353  1989.000000     2648.000000     2188.878906     85.107268
.03
750234  2357.000000     3343.000000     2930.601562     194.399168
.06
872048  2785.000000     4867.000000     3406.437500     401.799473
.11
879673  2986.000000     3998.000000     3436.222656     135.385421
.03
963682  3087.000000     5098.000000     3764.382812     354.615937
.09
997920  2281.000000     7349.000000     3898.125000     461.347009
.11
1046770 3596.000000     5469.000000     4088.945312     284.685800
.06
1112976 3989.000000     7686.000000     4347.562500     254.226423
.05
1327104 3862.000000     6749.000000     5184.000000     468.322080
.09
1687669 5924.000000     7186.000000     6592.457031     204.379086
.03
1873351 6614.000000     8776.000000     7317.777344     261.643008
.03
2072772 7342.000000     10161.000000    8096.765625     305.602913
.03
2158683 7677.000000     10622.000000    8432.355469     301.601096
.03
2564291 9426.000000     11323.000000    10016.761719    281.364730
.02
3348820 9985.000000     16478.000000    13081.328125    1147.182354
.08
5772632 20494.000000    29802.000000    22549.343750    902.669842
.04
27407781        103531.000000   117263.000000   107061.644531
2093.466687      .01
34950972        124450.000000   209336.000000   136527.234375
6884.022945      .05
37495739        130774.000000   178224.000000   146467.730469
5398.963450      .03
42047988        144620.000000   201436.000000   164249.953125
9260.798167      .05
195211548       685275.000000   941261.000000   762545.109375
37221.965966     .04
224484882       787278.000000   1017605.000000  876894.070312
34405.892708     .03
596657226       2131931.000000  2887000.000000  2330692.289062
89687.773147     .03


<Prev in Thread] Current Thread [Next in Thread>