Re: On encodings : random thoughts....

Alain FONTAINE writes:

2- The two main transfer-encodings proposed are of a very different
nature. BASE64 is a scheme where the octet stream is transfer-encoded
into a string of printable characters. One is of course aware that those
characters will themselves be natural-encoded as a binary stream for
transmission, but the whole scheme is designed so that the values used
for transmission dont play any role at all. So a BASE64
transfer-encoding prepared on a machine based on one natural-code will
decode properly on a machine based on another natural-code if a 'normal'
transcoding has been performed (BASE64 is of course one million or more
times better than uuencode since a- it is documented outside source code
b- the characters selected for the transfer-encoded representation have
all chances to be correctly transcoded in any usable mail gateway). To
state it otherwise : each machine only has to be able to recognize the
64 characters while natural-encoded in the local code. With some care
(using character constants, etc), it is possible to write a decoding
program that will work on machines based on different natural-codes, by
simple recompilation (of the transcoded source, of course..).
QUOTED-PRINTABLE is quite a different animal : there is no separation
between the encoded and the encoding values. Entering into the details
of what can happen would certainly bore everyone to death, but I am
ready to do the exercise if anyone does care to read it. The final
conclusion is that, for faithfully decoding QUOTED-PRINTABLE, the
decoder should a- know the natural-code of the machine it is running on
b-know the natural-code used by the transfer-encoder to write the
transfer-encoded message (how) c-perform a reverse transcoding to
recover the transfer-encoded message as written by the transfer-encoder
d-perform the transfer-decoding and e-if the object is readable text,
transcode again into the local natural-code to make readable. Of course,
all this will probably fail anyway if the mail has gone through a
gateway between different natural-codes, for the same reason that make
uuencode fail in the same circumstances (anyone pretending the contrary
does not work in The Real World TM). It seem that QUOTED-PRINTABLE does
also not protect the trailing blanks from the voracious appetite of some
gateways...



Decoding quoted-printable seems a bit easier than that:
- Unquoted single characters (like 'a') are decoded into the bit pattern used
  to represent that character in MAILASCII.  (Admittedly this is difficult
  if that bit pattern isn't recognized as being one of those characters on
  your machine -- e.g. if you are an EBCDIC machine, and you see the character
  x'AD' produced by some ASCII->EBCDIC translator, but your local variant
  of EBCDIC doesn't use x'AD' to represent left bracket.)
- Characters preceeded by ampersand use the bit pattern of the second character
  that represents the character in MAILASCII, plus 80 hex.
- Hex constants preceeded by backslash define a particular bit pattern
  that can be decoded anywhere.  (except that backslash isn't in some
  EBCDIC variants...oops!)
- ends-of-lines preceeded by backslash are ignored
- unescaped ends-of-lines within encoded data are interpreted according
  to some convention which needs to be defined.

The resulting octet string is fed to the interpreter for whatever content-type
applies to this particular body part.  This interpreter does, of course, have
to know about the particular hardware being used to display that body part,
including how to represent the various characters being displayed.

Encoding is also easy to describe:

First you have to convert the byte stream from local representation according
to how things are represented for that content type.  This might, for instance,
mean converting from the particular IBM code page used on your local system, to
ISO 8859/1 (or whatever content-type is being used).  Then you encode the
resulting octet-string as follows:

- if the byte to be encoded corresponds to one of a reasonable subset of the
  printable MAILASCII characters, encode the byte as that character.
- if the byte to be encoded, less 80 hex, corresponds to one of the same subset
  of printable MAILASCII characters, encode the byte using ampersand and that
  character.
- otherwise, encode the byte using a backslash and two hex characters.

"reasonable subset" depends on what you are trying to encode.  If you are
encoding text to be read by humans and the emphasis is on readablility, you
probably want to use most of the set of printable ASCII characters, on the
assumption that most of the readers will have ASCII terminals.  On the other
hand, if you are encoding data which needs to be kept intact end-to-end to be
fully utilized, but which is still useful if read by humans, then you probably
want to use a smaller set of printable characters, expressing the "wierd" ones
with hex constants.

Note that the encoded form always uses some subset of the characters in ASCII. 
If the encoding system cannot represent the ASCII character that corresponds to
a particular bit pattern, it can use the hex encoding instead.  Decoding on
non-ASCII systems is a little harder, but if the message is just being read by
humans it will still be mostly-readable.  Since the message encoded thusly is
intended for an ASCII system, it probably isn't usable (except to be read by
humans) on an EBCDIC system anyway.

Keith