Re: gzip/deflate compression/encoding


On Tue June 28 2005 21:03, Laird Breyer wrote:


On Jun 28 2005, Bruce Lilly wrote:

1. the bottleneck might be elsewhere, known to the originator, but
   unknown to the network operator
2. That still implies 3749 support at the operator's side
3. it assumes that the end user knows enough to/can/will install
   TLS support including the 3749 extension


These are pretty standard objections, which apply to software updates,
whether for TLS support or MIME/CTE. They don't serve as a strong
differentiator in favour of either approach.


MIME transfer encoding is typically used end-to-end.  It doesn't
matter where the bottleneck is, doesn't require any special software
support at intermediate sites.

See RFC 1123 section 5.3.8.


Seems a little outdated ;-)


Limits still exist. Now there's an extension so that the client can
specify the size before sending.  If the SMTP receiver can't handle
the message size, the options are compress, fragment, or bounce.  A
bounce is a failure, fragmentation has several issues, and that
leaves compression.

Even if you're transferring say an MPEG video stream,
there are key frames which allow small random corruptions to be ignored.
But compress the video, and you can throw it away if there's a corruption.


MPEG *IS* compressed!  Loss of a P, B, or I frame will be a (temporary)
problem. There are techniques which can be used to detect and correct
some errors (at the cost of additional overhead), and there are other
video-specific techniques which can be used to conceal detected but
uncorrectable errors.  A general compression scheme, whether in a
MIME CTE or in TLS, is not a particularly good choice for compression
of audio, video, or image data; these media types have specific
compression methods tailored to the media characteristics.  For example,
a particular JPEG image I have is 4088362 octets; gzip yields
4083504 octets, which is negligible additional compression [*].  Now one
might be able to achieve some compression of particular subtypes of
some of these media types (e.g. a bitmap PCM image format), but the
resulting size will typically be no smaller than what could be achieved
with a lossless variant of media-specific compression (JPEG in the
case of image media).  General-purpose compression may however be useful
for text, text-like content (scripting languages, etc.) and some
application data.

Conclusions:
1. binary-to-8bit encoding w/o gzip compression is fine for audio,
   image, and video media which already has media-specific compression
   (for 8bit transport, avoiding the 37+% expansion of base64)
2. gzip compression seems to be suitable for only a limited set of media
   types, therefore:
   a. plan on something that is amenable to multiple compression schemes,
      each alone (binary output) and coupled with encodings to 8bit and
      7bit.
   or
   b. choose a compression framework that is adaptable, pluggable, or
      configurable with parameters for different media characteristics
   or
   c. define suitable text, model, and application media subtypes with
      built-in compression, analaogous to compression already available
      in audio, image, and video media, obviating the need for compression
      in CTE
   or
   d. take the up-front pain of adding a handful of new CTEs now (with
      and w/o compression) and live with the one compression method
which pretty much leads back to my earlier observations about proliferation
of encoding tags vs. some means of specifying compression orthogonally to
encoding per se.  To which I would add that given the end-to-end nature
of MIME CTE, even a brute-force adaptive scheme (try N different compression
methods, pick the best and tag the data for decompression) is reasonable
(it would be unreasonable to do that on a hop-by-hop transport-level basis).

For example, there's the scenario of sending a message containing
a huge binary attachment.


For example a spreadsheet in application/vnd.ms-excel format. An order
of magnitude size reduction is typically possible via compression. I
have just such a spreadsheet; it is 428544 octets in native form and
69674 octets in OpenOffice.org (compressed) .sxc format.


That's a bad example. The Excel format is a binary serialized object format
unlike the XML OpenOffice.org format, so you've done a lossy conversion.


No, the information is all there (both formats can be opened in OpenOffice
and contain the same spreadsheet data).

Compress the original spreadsheet, since that's what the CTE would do.


Whatever. 167572 octets. So compare:

428544 -> base64 -> 586429
428544 -> gzip-8bit -> 171266 (assuming about 2% increase for escapes and
1000/998 for CRLF line endings)

That's about a 3.4:1 size ratio between base64 and gzip-8bit.  Quite a
size difference, though (unsurprisingly) not as much of a difference as
using an optimized representation that works with compression (sxc format).

For example, compressed filesystems solve this same storage space
issue more efficiently at the system level,


It's not a storage space issue, it's an end-to-end issue.

The bandwidth issue can be addressed transparently with TLS/SSL, or
Apache/mod_gzip, etc. for any number of current applications.


TLS isn't end-to-end, and that does raise storage (and CPU) issues.
Moreover, it doesn't address some of the application-level issues (see
below).

Of course, all these transparent ways must be implemented and deployed
just the same, but what is the serious long term benefit in making MIME
carry CTEs? Every MIME aware application will have to be adapted to
read those new encodings.


See Harald's message.

Look at the costs associated with your alternative for a 3-hop
message transfer using TLS and filesystem compression:
1. compress in TLS from source on first hop, decompress at receiver
2. compress again in filesystem for local storage, decompress on read
3. compress in TLS from 1st to second hop, decompress at receiver
4. compress yet again in filesystem for local storage, decompress on read
5. compress again in TLS for 3rd hop, decompress at receiver
and that still doesn't address SMTP size limits, since those apply
between TLS decompression and filesystem compression.  That's a lot of
repetitive compression and decompression, a lot of software at several
sites that needs to be updated, and it still doesn't address an important
issue.  Nor is it adaptive to the needs of different media types.

Compared to an end-to-end solution where the issues are addressed
and only the endpoints need be concerned with compression and decompression.

-----
* similar for video: a 5394004 octet MPEG file is "compressed" by gzip
  to 5260869 octets.  MP3 audio is left as an exercise for the reader.
  Incidentally, if one tries to compress that sxc spreadsheet with gzip,
  one goes from 69674 octets to 68051 octets.