Re: A bad journey (an apocryphal war story)

Do I understand correctly that the right way to do the transport encoding
depends on the type of the object transmitted ? This is contrary to the
credo I heard before : transport encoding can be done/undone without knowing
anything about the type of the transported object.


I don't know where you got this idea, but it is definitely not correct. The
choice of encoding is entirely independent of the object's type. This is, I
believe, explicitly spelled out in MIME.

It may make sense to pick an encoding based on some knowledge of the object
type, but this is definitely not required. I personally prefer to pick the
encoding that produces the smaller resulting representation. I base this
on a partial analysis and estimate of the results of encoding.

Anyway, if doing the right thing is possible, but 'hard', most implementations
will do it wrong (there are enough simple things done wrong to prove this
point). A specification that leads to such a conclusion should be considered
broken in the first place.


You are confusing the notions of encoding and representation. Regardless of
whether or not you couple encoding to object types in any way, encoding and
representation remain entirely distinct.

For each type of object there is a canonical representation format. The
representation for text, for example, is explicitly spelled out in
RFC821/RFC822. This representation is lines separated by CR-LF sequences.
If you have material that is not separated into lines in this way, it is not
text and the encoding you use does not matter at all.

Consider application/postscript. PostScript has a representation format that is
explicitly defined in the PostScript language reference manual. There are no
line breaks in it per se; there is only white space between operators, and this
can optionally be a CR, LF, or CR LF. Thus you can have PostScript that looks
like text and meets the criteria of 7bit or 8bit encoding, but you may have
PostScript that does not look even vaguely like this and falls far short of the
encoding criteria. But in either case the encoding has nothing to do with it.

Given an actual object in its representation format you can then sit down
and figure out what encodings are suitable for it. 7bit and 8bit are suitable
only for objects whose representation format consists of sequences of
bytes where fewer than 1000 characters go by without a CR LF in there
somewhere. There are also restrictions on the bytes that can appear in 7bit,
of course.

Quoted-printable and base64 are both equally capable of representing anything
at all. Quoted-printable looks best when it is applied to the canonical
representation of text, of course, but this does not mean that it is only
suitable for this usage. There are plenty of purely binary objects that will
encode more easily and transfer more quickly in quoted-printable. It all
depends on the frequency count of the various byte values in the data.

The only thing out of all of this that is not explicit in MIME, as far as I
can tell, is the fact that MIME implicitly extends the notion of text to
longer lines through the use of encodings. We probably should emphasize this
more. But the canonical representation of text remains intact largely because
it is defined in other documents and MIME does not change it in any way.

For a while we had the notion that it would be a good idea to couple the
encoding to the representation. The idea was to make the line break in
quoted-printable match up to whatever concept was convenient in the
representation format. For example, in a format that represents records in some
way a line break could then be used as a record boundary. We even had an
equivalent concept in base64, where we had defined a record delimiter
mechanism. But this gets very messy in a whole bunch of cases (the one that
immediately comes to mind is checksums. Applying a checksum after encoding
gives you a different checksum depending on the encoding. This is not good.
Applying the checksum to the data stream before encoding only works as long
as there is no out-of-band information in the encoding. If you have information
in the encoding that's not part of the data stream it will not be checksummed,
and a separate checksum would be needed to handle out-of-band information.
The idea of having two checksums is also not good.

As I recall, the problems with checksumming led us into a discussion of the
whole idea of tying the representation to the encoding. This leads to trouble
when converting from 7bit or 8bit to base64. Do you use record delimiters or
imbedded CR-LFs? If someone converts base64 to quoted-printable what
do CR-LF's in the base64 representation map to? These questions can be
answered consistently, but the result is not appealing. This led to the
decision to get rid of out-of-bound information in all encodings (a wise
choice, in my opinion) and to totally decouple representation from encoding
(also a wise choice, in my opinion).

This is how it went to the best of my knowledge/memory. People seemed to be
very enthusiastic about the result. Since any 822 compliant implementation
today has to deal with CR-LF delimiters in text, there is no reason to
suppose that this is any harder than other, system-specific formats. UNIX
likes LF, Macs like CR, PCs like CR-LFs, and VMS can handle any or all of
them simultaneously without any problems at all. But the Internet chose
CR-LF some time ago as the delimitor in the canonical representation of text,
and we did not change that in any way in MIME.

                                        Ned