On encodings : random thoughts....

Following this discussion since it's beginning, I am trying to see as
clearly as possible through the mud... Encodings play a vital role in
the whole RFC-XXXX scheme, and this message only reflects some internal
discussions I have with myself while trying to understand what happens.
This is neither a tutorial (I am not a professor), nor a set of
proposals. Just consider it as some easily inflammable material
submitted to a gang of professional flamers 8-)...

1- The only kind of encoding RFC-XXXX speaks about is transfer-encoding,
which is needed to extend the capabilities of the transfer
architecture. The transfer-encodings proposed are in fact simply
extending the transfer architecture to make it able to transfer
objects that are 'octet streams'. So it seems to me that one important
unsaid hypothesis is that any 'end-user visible' object is amenable to
that peculiar form. This of course means that another level of encoding
has been performed on the 'end-user visible' objects. As far as I
understand what's in RFC-XXXX, the Content-type header field does in
some cases clearly identify both the true type of the object and its
encoding into an octet stream. 'G3FAX, for example, indicates that the
object is a matrix of dots, encoded according to some well-know rules to
become an octet stream. In some other cases, things are unfortunately
not so clear. For texts, I would tend to consider that the true type of
the object is a combination of the fact of being a text (lines, etc) and
of the repertoire of characters it contains ; then this type is encoded
according to some table listing octet (or multi-octet) values for each
character in the repertoire, or even by methods involving states and
table switching commands. For example, the same 'end user visible' text
containing characters in the repertoire 'latin-1' can be encoded in lots
of different ways to become an octet string (ISO8859-1, 9 'EBCDIC'
CECP's, ISO10656, UNICODE, and probably some variants using ISO2022
immediatly come to the mind).
To summarize, there is an application-encoding (for example dot-matrix
to octets in G3FAX), a transfer-encoding (octet stream to mail encoded
form) and the natural-encoding of the underlying machine, used without
even noting its presence, but important nonetheless : if the
transfer-encoding rules say 'emit an 'A'', an ASCII machine will write
x'41' while an EBCDIC machine will write x'C1'... The fact that in many
cases, the first and the third encodings are the same or nearly is a
potent contributor to the mud surrounding the subject...

2- The two main transfer-encodings proposed are of a very different
nature. BASE64 is a scheme where the octet stream is transfer-encoded
into a string of printable characters. One is of course aware that those
characters will themselves be natural-encoded as a binary stream for
transmission, but the whole scheme is designed so that the values used
for transmission dont play any role at all. So a BASE64
transfer-encoding prepared on a machine based on one natural-code will
decode properly on a machine based on another natural-code if a 'normal'
transcoding has been performed (BASE64 is of course one million or more
times better than uuencode since a- it is documented outside source code
b- the characters selected for the transfer-encoded representation have
all chances to be correctly transcoded in any usable mail gateway). To
state it otherwise : each machine only has to be able to recognize the
64 characters while natural-encoded in the local code. With some care
(using character constants, etc), it is possible to write a decoding
program that will work on machines based on different natural-codes, by
simple recompilation (of the transcoded source, of course..).
QUOTED-PRINTABLE is quite a different animal : there is no separation
between the encoded and the encoding values. Entering into the details
of what can happen would certainly bore everyone to death, but I am
ready to do the exercise if anyone does care to read it. The final
conclusion is that, for faithfully decoding QUOTED-PRINTABLE, the
decoder should a- know the natural-code of the machine it is running on
b-know the natural-code used by the transfer-encoder to write the
transfer-encoded message (how) c-perform a reverse transcoding to
recover the transfer-encoded message as written by the transfer-encoder
d-perform the transfer-decoding and e-if the object is readable text,
transcode again into the local natural-code to make readable. Of course,
all this will probably fail anyway if the mail has gone through a
gateway between different natural-codes, for the same reason that make
uuencode fail in the same circumstances (anyone pretending the contrary
does not work in The Real World TM). It seem that QUOTED-PRINTABLE does
also not protect the trailing blanks from the voracious appetite of some
gateways...

3- Transfer-encodings cost : processor time to perform them, and network
bandwith since they nearly always make the object size grow. It seems
reasonable to try to avoid to apply a transport-encoding more than once,
since it's purpose is to enhance the transport architecture. The ideal
situation would then be to encode the message just before, uh, let's say
'just before starting to transport it' (I don't want to get caught in
the UA/MTA philosophical discussion), but always with 'UA authority'.
The only place where multiple transfer-encodings are a real threat is
the multipart message, with it's eventual recursive encapsulations. But
since the transport-encoding is needed only for avoiding some
limitations of the tranport architecture, I don't see any reason why
parts or recursively encapsulated things should be already
transfer-encoded before transfer-encoding the complete message. So a
complete multipart message could be in it's octet streams (after the
application-encoding) preceded by headers and separated by separators
form before a wholesale transfer-encoding. This means that all UA should
be able to faithfully store an octet stream, even if they are unable to
interpret it in any way. Some (important) details would have to be
worked on, like the fact that wholesale tranfer-encoding should leave
the structure alone, and not encode the part separators and the
encapsulated parts headers to keep the structure visible without
transfer-decoding the message. Also, if everyone feels than more than
one transfer-encoding is really needed, one could imagine ways to ensure
that each part gets the right encoding (and note that with this system,
it will get that encoding and that encoding only.... ).

*- This is becoming long, and probably contains more stupid things
written in bad english than one can reasonably be expected to stand. I
just want to thank those who have read up to this point, if any...
Flames are very welcome next week : I'll be in Blois (2nd European Networking
Conference) till thursday 8-). But e-mail does pile up, so I'll read answers
starting next friday.

Alain FONTAINE                       +--------------------------------+
Universite Catholique de Louvain     | If your mail software barks at |
Service d'Etudes Informatiques       | my address, you may try :      |
Batiment Pythagore                   |                                |
Place des Sciences, 4                |     FNTA80(_at_)BUCLLN11(_dot_)BITNET    
 |
B-1348 Louvain-la-Neuve, BELGIUM     +--------------------------------+
phone +32 (10) 47-2625