Re: gzip/deflate compression/encoding


Bruce Lilly <blilly(_at_)erols(_dot_)com> wrote:

On Fri July 1 2005 18:33, ned+ietf-822(_at_)mrochek(_dot_)com wrote:

It is relatively easy to design a scheme that limits the
overhead to 1-2% no matter what the input.


Maybe, depending on the constraints.  The minimum constraints on the
output are:
o CRLF only for line endings, no lone CR or lone LF
o no NUL
o line length <= 998 octets
(for that is the definition of 8bit).  The input of course is an
unconstrained sequence of octets.

About 1.4% expansion should be possible with only those constraints,
a fairly simple algorithm (faster decode than encode), and moderate
encoder memory requirements (an 84 octet input buffer).


I think we must have all had similar ideas in mind.  I finally got
around to specifying (informally) and implementing (as a demo) my idea,
see below.

AMC

/*
base253.c 0.0.0 (2005-Jul-17-Sun)
Adam M. Costello
http://www.nicemice.net/amc/

Base253 is a proposed MIME content-transfer-encoding that transforms an
arbitrary sequence of bytes (octets) into 8bit data, which is defined in
RFC-2045 as lines ending with the two bytes CR LF (13 10) containing at
most 998 other bytes, none of which are NUL (0), CR (13), or LF (10).

Base253 is designed to be compact (for both typical input and worst-case
input), simple to encode, even simpler to decode, and simple to explain
and understand.

This implementation expands the data by a theoretical maximum ratio of
935/920 = 1.63% (if I have correctly imagined the worst-case input), but
in tests on JPEG, MP3, and gzip files the actual expansion was 0.65% to
0.67%.

This implementation is a demo intended to illustrate the algorithms and
enable measurements of the output.  A practical implementation would do
more error checking and would probably have a different interface.

Specification:

In the encoded stream, each line is a sequence of long blocks, short
blocks, and filler bytes, in any order, followed by the line break.
Decoders ignore filler bytes and line breaks, and view an encoded
stream as a single sequence of blocks.  A filler byte is byte 15, which
can appear in the middle of a line, but is intended to be inserted
at the end of lines that would otherwise end with whitespace (which
some MTAs might strip off), and at the beginning of lines that would
otherwise begin with two hyphens (which might accidentally match MIME
part boundaries).  A long block is a header byte k in 1..9 followed
by 80k data bytes.  A short block is a header byte 16+80i+n (where i
is in 0..2, and n is in 0..79) followed by n data bytes.  Every block
(long or short) except the very first block has an additional data byte
implicitly prepended to its data if the previous block was a short
block.  The implicit byte is 0, 10, or 13, depending on whether the i
value of the previous block was 0, 1, or 2, respectively.  The meaning
of the encoded stream is the sequence of all data bytes (both explicit
and implicit).

Rationale:

A block cannot span a line break.  If it could, then either it would
have to include the line break as literal data (in which case the
common practice of storing messages using local line-break conventions
would corrupt the data or require decoders to convert back to the wire
representation) or else the line breaks would have to be ignored even
inside blocks (in which case decoders would have to inspect data bytes,
whereas non-validating decoders of the format described above can
blindly copy data bytes without inspecting them).

The bytes 11, 12, and 14 are not used except as data bytes.  The range
of n or k could be increased by using these bytes values (and either
range could be further increased at the expense of the other).  The
ideal allocation might be 0..81 for n and 1..6 for k (plus 1 filler byte
and the 3 prohibited bytes), but any such changes would make the ranges
non-contiguous, which would complicate the encoding.  The improvement in
compactness would be extremely slight, so it doesn't seem worth it.

An unintended fortuitous consequence of the formula 16+80i+n is that the
constants 16 and 80 are both divisible by 16, which makes it easier to
debug the output by looking at a hex dump.

This is ANSI C (C89) code.

*/


#include <assert.h>
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>


/* non-validating non-error-checking decoder */

static void decode(FILE *in, FILE *out)
{
  static const int banned[3] = {0,10,13};
  int previ, byte, nplus80i, i, n;
  unsigned char buf[80*9];

  /*

  The decoder doesn't actually need any lookahead at all, but it's
  simpler and more efficient to use a buffer to copy whole blocks rather
  than to copy a byte at a time.

  previ >= 0 means banned[previ] will be inserted before the next block
  (if there is one); previ < 0 means nothing will be inserted.

  n >= 0 means the current block contains n bytes; n < 0 means there is 
  no current block, just a line break, filler byte, or invalid byte.    

  Since this is a non-validating decoder, we are free to treat invalid
  bytes any way we like.  It's convenient to treat bytes 11, 12, 14,
  lone 10, and lone 13 like filler bytes, and to treat byte 0 like the
  header of a zero-length long block.

  */

  for (previ = -1;;) {
    byte = getc(in);
    if (byte == EOF) break;
    nplus80i = byte - 16;

    if (nplus80i >= 0) { /* short block */
      i = nplus80i / 80;
      n = nplus80i % 80;
    }
    else if (byte <= 9) { /* long block */
      i = -1;
      n = 80 * byte;
    }
    else continue; /* ignore filler or line break */

    if (previ >= 0) putc(banned[previ], out);
    fread(buf,n,1,in);
    fwrite(buf,n,1,out);
    previ = i;
  }
}


/* non-error-checking encoder */

static void encode(FILE *in, FILE *out)
{
  enum { linemax = 997 };
    /* maximum number of bytes in a line, excluding the line   */
    /* ending, which can be up to three bytes (filler, CR, LF) */

  int roomleft, m, byte, limbo, lastout, i, n, header;

  unsigned char buf[80*9];
    /* 80*9 bytes is the largest lookahead that is useful.  An    */
    /* encoder could use as little as 80 bytes of lookahead and   */
    /* still produce valid streams that are not quite as compact. */

  roomleft = linemax;  /* bytes allowed to be added to the current line */
  m = 0;               /* bytes in buf[] */
  byte = getc(in);     /* latest input byte */
  limbo = 0;           /* boolean: an implicit byte is waiting to hitch */
                       /* a ride on the next block */
  lastout = 10;        /* most recent byte written, initially LF to */
                       /* indicate the beginning of a line */

  for (;;) {
    /* Each iteration of this loop writes one block, */
    /* a header byte followed by the first n bytes   */
    /* of buf[] (header and n will be computed).     */

    /* First fill the buffer with data bytes (that is, */
    /* bytes other than the three forbidden ones):     */

    while (byte != EOF && byte != 0 && byte != 10 && byte != 13 && m < 80*9) {
      buf[m++] = byte;
      byte = getc(in);
    }

    if (m == 0 && byte == EOF && !limbo) break;  /* all done */

    if (m < 80) {  /* short block */
      if (byte == EOF) i = limbo = 0;
      else {
        if (byte == 0) i = 0;
        else if (byte == 10) i = 1;
        else assert(byte == 13), i = 2;
        /* Now that byte goes into limbo, and we read the next byte. */
        limbo = 1;
        byte = getc(in);
      }

      n = m;
      header = 16 + 80*i + n;
    }
    else {  /* long block */
      limbo = 0;
      n = m - m % 80;

      if (1+n > roomleft) {  /* block won't fit on current line */
        if (roomleft <= 357) {
          /* End the current line, increasing roomleft to more than n. */
          /* Insert a filler byte if the line would end in whitespace. */
          if (isspace(lastout)) putc(15,out);
          putc(13,out);
          putc(10,out);
          lastout = 10;
          roomleft = linemax;
        }
        else {
          /* Split the block, decreasing n to less than roomleft.    */
          /* The threshold of 357 is the point where the overhead of */
          /* splitting the block becomes outweighed by the better    */
          /* amortization of the trailing CRLF over a longer line.   */

          n = roomleft - 1;
          n = n - n % 80;
        }
      }

      header = n / 80;
      assert (header >= 1 && header <= 9);

      /* Insert a filler byte at the start of */
      /* a line that begins with two hyphens.  */

      if (lastout == 10 && header == 45 && buf[0] == 45) {
        /* It's possible that m is 0, in which case buf[0] would */
        /* be garbage, but in that case header would be 16 or 96 */
        /* or 176, not 45, so the && protects us.                */

        putc(15,out);
        --roomleft;
      }
    }

    /* Write the block. */

    putc(header,out);
    fwrite(buf,n,1,out);
    lastout = n == 0 ? header : buf[n-1];
    roomleft -= 1+n;

    /* Slide the buffer.  A circular buffer might be more */
    /* efficient, but sliding is simpler to implement.    */

    m -= n;
    if (m > 0) memmove(buf, buf + n, m);
  }

  /* End the last line if we're in the middle of it. */

  if (lastout != 10) {
    if (isspace(lastout)) putc(15,out);
    putc(13,out);
    putc(10,out);
  }
}


int main(int argc, char **argv)
{
  if ( argc != 2 ||
       (strcmp(argv[1], "-e") != 0 && strcmp(argv[1], "-d") != 0) ) {
    fprintf(stderr, "usage:\n%s -e\n%s -d\n", argv[0], argv[0]);
    return EXIT_FAILURE;
  }

  if (argv[1][1] == 'e') encode(stdin,stdout);
  else assert(argv[1][1] == 'd'), decode(stdin,stdout);
  return EXIT_SUCCESS;
}