Re: gzip/deflate compression/encoding


On Sun July 17 2005 05:22, Adam M. Costello wrote:


Bruce Lilly <blilly(_at_)erols(_dot_)com> wrote:

On Fri July 1 2005 18:33, ned+ietf-822(_at_)mrochek(_dot_)com wrote:

It is relatively easy to design a scheme that limits the
overhead to 1-2% no matter what the input.


Maybe, depending on the constraints.  The minimum constraints on the
output are:
o CRLF only for line endings, no lone CR or lone LF
o no NUL
o line length <= 998 octets
(for that is the definition of 8bit).  The input of course is an
unconstrained sequence of octets.

About 1.4% expansion should be possible with only those constraints,
a fairly simple algorithm (faster decode than encode), and moderate
encoder memory requirements (an 84 octet input buffer).


I think we must have all had similar ideas in mind.  I finally got
around to specifying (informally) and implementing (as a demo) my idea,
see below.


FYI, I based the numbers above on:
o elimination of N=3 octets (0x00, 0x0A, 0x0D) from the output sequence
  leaves 253=(256-N) valid output codes
o output consists of an addition code (1 octet) followed by a block of
  modified output octets (each octet is input value plus addition code,
  modulo 256).  The addition code is selected for a block such that the
  taboo output codes are not produced.  Obviously, addition codes cannot
  assume taboo values (in this case, 0, 10, and 13 (decimal))
o a buffer (block) length of M=(256-N)/N is the maximum length that is
  guaranteed to work for N and sets a lower bound on the expansion factor.
  At N=3, M=84
o expansion is (M+1)/M*1000/998, which is broken down as M+1 output
  octets per M input octets and CRLF insertion after 998 output octets
  (decoder ignores CR, LF). For N=3, M=84, expansion is 1.39% (n=4,
  M=62 yields expansion of 1.82%, larger N with this method yields >2%
  expansion).  Expansion is independent of input octet sequence.
o to simplify decoding, block length is fixed (not variable) and requires
  no buffering (of course the decoder needs to keep track of the data
  count up to the buffer length and needs to save and apply the offset
  code to each octet)