ietf-822
[Top] [All Lists]

Re: gzip/deflate compression/encoding

2005-06-28 04:02:26

On Mon June 27 2005 21:00, Laird Breyer wrote:

On Jun 27 2005, Bruce Lilly wrote:

That's correct, but surely it's the way it should be: if some network
admin wants to allow faster transfers to his users, then he'll set up
the software.

1. the bottleneck might be elsewhere, known to the originator, but
   unknown to the network operator
2. That still implies 3749 support at the operator's side
3. it assumes that the end user knows enough to/can/will install
   TLS support including the 3749 extension
 
Such compression cannot work around e.g. SMTP message size limits as
the message would be stored in uncompressed form.

Which SMTP size limits are those?

See RFC 1123 section 5.3.8.

I'm not convinced that it is the job of the message format to make 
storage easier or smaller.

We're talking about MIME, not the Message Format.  MIME does a number
of things to work around transport issues.  A side-effect of some of those
(notably the two non-identity encodings) is a substantial increase in
message size.  Additional encodings that do not substantially increase
message size would be welcome.  If they can decrease message size, so
much the better.

In fact, the best compression can only be achieved 
by the final destination system, because it can combine the characteristics
of all received messages in optimal ways. 

You are assuming that the sole criterion is storage size on the ultimate
destination system, which is not a criterion at all in the current
discussion.

Indicating the fact of compression and the method by...? (w/o obscuring
the nature of the media type that is compressed)

What's wrong with an application/gzip (or similar) attachment and a 
Content-Description field which tells the user what's inside?

It obscures the nature of the media type that is compressed.  From that
there is no machine-readable way to determine whether it's text, image,
audio, video, or model data.  It's claimed to be application data.  If
the data is in fact a composite media type (message or multipart) your
scheme won't work at all (within the MIME framework) (of course, a
transfer encoding can't be applied overall either, but can be applied to
individual parts of the composite entity).

The Content-Description field is not machine-readable; it is a
human-readable unstructured field.  Suppose the content is being sent
to a mailing list which is expanded to a hundred recipients, each of which
speaks a different language from the others.  What would you put in a
Content-Description field such that it wouldn't be larger than any other
message component and wouldn't imply some sort of favoritism?
 
There are two distinct issues at work:
1. encoding binary data to fit into 8bit (as opposed to 7bit) transport
   can be done in a more space-efficient manner than is possible with
   binary-to-7bit encoding
2. compression for size reduction of stored/transmitted content

I'd like to understand these points better. What kind of space saving
scenario do you have in mind?

For binary data, base64 encoding uses a 3:4 encoding with insertion of
a CRLF pair after at most every 76 octets of output.  That's an expansion
factor of 4 / 3 * 78 / 76 or 26:19 or about a 37% (minimum) increase in
size.  For binary data, quoted-printable typically expands data by a much
greater factor; quoted-printable should probably be viewed as an ISO-8859
text 8bit-to-7bit encoding.
 
For example, there's the scenario of sending a message containing
a huge binary attachment.

For example a spreadsheet in application/vnd.ms-excel format. An order
of magnitude size reduction is typically possible via compression. I
have just such a spreadsheet; it is 428544 octets in native form and
69674 octets in OpenOffice.org (compressed) .sxc format.

Arguably in that case, the sender is misusing 
email as a file transfer facility, and should only send instructions on
how to retrieve the file separately.

You are assuming that:
1. the sender has a way to store the file such that it can be retrieved
   by others w/o corruption
2. that bandwidth from the sender to the hypothetical storage place
   isn't an issue
3. that size isn't an issue at the hypothetical storage place 
4. that all recipients are able to retrieve files by the same hypothetical
   method
5. that bandwidth at each recipient's side is not an issue
6. that such an arrangement is agreeable to the sender and all recipients
 
Another scenario is the case of an SMTP agent being overwhelmed by
large message numbers such as spam.

Not a criterion.