gzip-8bit


Juergen Helbing wrote:

And because it seems necessary to re-specify yEnc entirely before it
is used as an official MIME encoding this might be the right time to
learn from the past and to do it better now.


Here's some procrastination (a gzip-8bit CTE) from my procrastination (a
new Usefor draft) from my real work.  As always, comments are
appreciated.


Gzip-8bit Content-Transfer-Encoding

This document specifies a new standards-track Content-Transfer-Encoding
(CTE), designed mainly for use within the Usenet environment.  It
combines gzip [RFC 1952] compression (which uses RFC 1951 deflate) with
conversion from binary to 8bit, to create a encoding that is both
compact and can successfully transit nearly 100% of the Usenet
infrastructure.

Why a new CTE?

RFC 2045 says:

   Unlike media types and subtypes, the creation of new Content-
   Transfer-Encoding values is STRONGLY discouraged, as it seems likely
   to hinder interoperability with little potential benefit.

Obviously, gzip-8bit needs to overcome a high hurdle to justify
standardization.  The basic argument for a new CTE is that a significant
portion of Usenet traffic consists of posting binary content, and that
users are currently manually encoding that content with one or more
non-standardized schemes.  In particular, the adoption of [yEnc]
<http://www.yenc.com> shows the interest by implementers and end-users
in a format that does not have the 33% overhead over binary of base64.

It is believed that a standards-based replacement for yEnc could learn
from their technical and implementation experience, and provide a better
end-user service without causing any significant interoperability
failures.  (I.e., receivers will need to support the new CTE, but then
they also had to or have to learn to support yEnc.)

Why gzip?

Although significant portions of posted content (including many image,
video, and audio files) are already in a compressed format and so will
not benefit directly from gzip compression, gzip still provides
benefits.  Specifically, it provides a run length encoding that
minimizes a bias that otherwise exists in binary files toward excessive
NULs.  By contrast, gzip output is far more evenly distributed among the
256 possible octets.

For text and many executable files, gzip can provide lossless
significant compression, often of 50% or more.  The algorithm used is
fully described in RFCs 1951 and 1952, and is amenable for use as a
Unix-style software filter.  It is believed to unencumbered by patents.
Software implementing gzip is widely available, including in open source
libraries.

Although it may seem inefficient to run gzip on content that may not be
especially compressible, the processing cost and latency of doing so is
insignificant on modern hardware.  It is believed that creating a
single, new CTE that always compresses with gzip (and then always
encodes in 8bit), is far better than creating multiple, optional
variations.

Why 8bit?

Nearly the entire Usenet infrastructure (user agents, injectors, and
servers) is 8bit-clean, but not binary-clean.  Quoting RFC 2049:

    (2)   Many systems may elect to represent and store text data
          using local newline conventions.  Local newline
          conventions may not match the RFC822 CRLF convention --
          systems are known that use plain CR, plain LF, CRLF, or
          counted records.  The result is that isolated CR and LF
          characters are not well tolerated in general; they may
          be lost or converted to delimiters on some systems, and
          hence must not be relied on.

    (3)   The transmission of NULs (US-ASCII value 0) is
          problematic in Internet mail.  (This is largely the
          result of NULs being used as a termination character by
          many of the standard runtime library routines in the C
          programming language.) The practice of using NULs as
          termination characters is so entrenched now that
          messages should not rely on them being preserved.

What content can it be used for?

As with all CTEs, the gzip-8bit CTE can be used with any media type.
Content in the text top-level must be canonicalized per RFC 2049 before
being encoded.

What about splitting files?

Gzip-8bit is NOT compatible with message/partial, which requires 7bit
Content-Transfer-Encoding.  A new MIME type such as
application/news-partial could be specified, but is it believed that the
disadvantages of doing so outweigh the benefits.  

The gzip-8bit CTE

What about gateways?

RFC 2045 explicitly prohibits multiple encodings, so it is not
acceptable for a gateway from news to mail to simple further encode
gzip-8bit in base64.  Instead, such a gateway could decode the gzip-8bit
CTE and then encode in base64.  Alternatively, if the destination MTA or
MUA supports 8BitMIME (RFC 1652) or BinaryMIME (RFC 3030), the gateway
can attempt to forward the message without modifying the CTE.  However,
due to the hop-by-hop nature of email, such a message could encounter an
MTA that only supports 7bit transport, which would have no choice but to
reject the message.

What are the steps for encoding?

Textual content is converted to canonical format, as described in RFC
2049.  gzip encoding is applied to the content as described in RFCs 1951
& 1952.  The CRC-32 value MUST be set.  The fname parameter MUST NOT be
set, as Content-Disposition [RFC 2183] used with [RFC 2231] provides a
standards-based way of providing this information that also supports
internationalization.  More fundamentally, it would be a layer
violation, because a filename is an attribute of a media type, not a
CTE.

The output of that compression is processed 900 octets at a time.  For
each 900 octets, NUL is replaced with "=A", CR with "=B", LF with "=C",
and "=" with "=D".  The octet stream is then appended with CRLF, and the
next 900 octets are processed the same way.

900 is chosen as the unescaped octet stream length because RFC 2822
prohibits lines to exceed 998 octets, plus the ending CRLF.  Since the
number of octets that will be escaped is not known, 900 seemed to
provide a large amount of margin (i.e., room for 49 escaped octets).
Since each octet output by gzip should be approximately equally likely,
and there are 256 possibilities, there should be on average 4 / 256 *
900 = 14 octets per line that need to be escaped.

MIME sending agents MUST NOT create line lengths of more than 998 octets
(plus the CRLF).  MIME receiving agents SHOULD accept gzip-8bit content
of any line length.

What are the steps for decoding?

CR and LF characters are removed from the input stream.  The 4 escaped
octets are replaced with their unescaped translations.  The resulting
stream is decompressed as described in RFCs 1951 & 1952, and the CRC-32
value MUST be checked.  If the content is in the text top-level media
type, line endings are converted to local form.

Where's the registration?

RFC 2048 requires the following information for a new CTE:

Naming Requirements

The name "gzip-8bit" conforms to the syntax of the CTE header field.

Algorithm Specification Requirements

The algorithm used is fully described in RFC 1952 and in this document.
It is believed to unencumbered by patents.  Software implementing gzip
is widely available, including in open source libraries.

Input Domain Requirements

Gzip-8bit is applicable to an arbitrary sequence of octets of any length

Output Range Requirements

Gzip-8bit is explicitly designed to always output 8bit data.

Data Integrity and Generality Requirements

Gzip-8bit is fully invertible on any platform.

New Functionality Requirements

Gzip-8bit requires approximately 33% less bytes to represent most binary
objects than base64.  For textual and much executable content, it can
often create a 50% or more reduction in size.


          - dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/>  <tel:+1-650-327-2600>