The KTH-proposal of a solution to the header-problem

After we have had a long discussion, we have finished this proposal
for a solution of the general non-ASCII-header-problem. The idea is
probably not new, but it is a description of how we really think it
might be solved.

We strongly recommend you to read this proposal and comment upon it.

The header-problem we have been discussing for some days now is the
proposal to freeze RFC-XXXX in current state, i.e. NOT allow anything
but ASCII (0-127) in the headers (strictly continue to follow RFC-822).

Today we already have several implementations of sendmails that violates
RFC-822 on the market, so we have to do something.

There is an IETF meeting in november where we hope the RFC-XXXX in its
final shape could be discussed, and if before that meeting something
like this was included in RFC-XXXX we would be very happy.

                        With regards
                                        Patrik
---------------------------------------------------------------------

How to make it possible to use non-US-ASCII in the headers.

Our proposal is based upon the introduction of two new headers:

        Header-Transfer-Encoding
        Header-Type

These are defined accordingly to the new RFC-XXXX headers:

        Content-Transfer-Encoding
        Content-Type

A) Header-Transfer-Encoding

Allowed Header-Tranfer-Encoding types are:
        
   - Quoted-Printable

        It is ok to use this encoding, but only if the ':', which
        is a special character according to RFC-822, is changed to '='
        (a non-special character). We will not use the '&', because 
        that one is used by Quoted-Redable. The result of the encoding 
        process will then not introduce any special character. An atom
        is still an atom.

   - 8-bit
   - 7-bit

The encoding can not be done with:

   - Base-64 and Binary

        Encodes whole parts of a text, not only a fragment, like a comment.

   - X-atom

        We can see several problems if we allow user-defined encodings 
        of headers that might have to be parsed by an old MTA or UA.


B) Header-Type

Allowed header-types:

   - US-ASCII
   - ISO-8859-1,2,3,....
        
        All the above will be easy to use.

   - Quoted-Readable
   - ISO-2022
   - ISO-10646-AUC

        These will also work, but they lead to som equoting
        considerations, see below.

Only one will not be used:

   - ISO-10646

        Out of date. (We suppose ISO-10646-ACU will be used instead)

C) There are four different parts of the headers where we have to look at:

        1 - Headers with the RFC-822 syntax *text
        2 - The Phrase in the RFC-822 description
        3 - The Comment in the RFC-822 description
        4 - Received-lines according to RFC-822

The characters allowed differs among the parts of the headers. 

Because of the possible introduction of multi-octet character sets,
RFC-822 schould be interpreted like this:

      Any reference to a specific character in RFC-822 is
      interpreted as a reference to the octet that represents
      this character in US-ASCII.

Special lexical parts from RFC-822:


Comment:
'(', ')' and '\' can occur in a comment only in quoted pairs.
That means that the octets 050, 051 och 0134, must be quoted
with a 0134 ('\' in US-ASCII). This is relevant for the Text
subtypes ISO-2022, Quoted-Readable and ISO-10646-AUC and maybe
in some future multi-octet standard.

Phrase and atom:
A phrase may consist of atoms and quoted strings. There is though
prohibition against 13 special characters in atoms. But in all places
interesting for this proposal (in for example the From-line and the
Keyword-line) the atom can be replaced by a quoted string. In the latter
only the octets 0134 ('\') and 042 ('"') must be quoted.

Note that we only have to introduce extra quoting because of the character
set only arise if we use Quoted-Readable, ISO-2022 and ISO-10646-AUC.

C) What parts of the header does the new header control

Headers:
        Subject:
        Comments:
        Content-Description:
        Summary: (from RFC-1036)
        Organization: (from RFC-1036)

        ...and all user-defined fields in RFC-822.

Other parts:
        The Phrase, but not in the Recieved-line
        The Comment, but not in the Recieved-line

D) What is so special with the Recieved-line?

The Recieved lines is added by MTA's during the transport of the mail
from the sender to the reciever. If the sender uses Header-Transfer-Encoding
Quoted-Printable and some old MTA adds a Recieved: line that by chance
contains a character mnemonic preceded by the intro character, it will
be decoded, and that was not what was intended.

That's why we don't change or encode the Recieved-lines.

The syntax of the Recieved-line is according to RFC-822.

E) Remarks

At the beginning we had RFC-822. It defined the characters in the headers
to be 0-127. Some parts of the headers included special characters.

When we start to use more characters that the 0-127 we have to encode
them in some way. Why not do that in the same, or at least equivalent way,
as the Body of the mail. We therefore saw the possibility to introduce
the Header-Transport-Encoding and the Header-Type.

One disadvantage is of course that we then introduce dependencies between
different headers. That in turn introduce that we have to parse the headers
twice, or at least sometimes twice if we start with a mode, guess a mode, 
and only re-parse if we find out that our guess was wrong.

On the other hand the end user, the most important person in
this whole discussion, is now able to use his own characters in
the different parts of the mail that he probably lokks at as if
they really do belong to his message, i.e. the mail-body itself.
He can use his characters on the Subject-line for example.

During the further discussions we had this morning in Stockholm,
we found out that it is possible to find the other textual parts
of the headers that doesn't have to be real parsable, i.e. it
doesn't matter if they are encoded in some way.

We found that the Comment (used as seperator in several fields) and
the Phrase (used in for example the From-field and Keywords-field)
included pure text except som especial characters. I will discuss them
later.

The idea then popped up that the Header-Type and Header-Transfer-Encoding
operates on just these fields: Subject, Comments, Content-Description,
Organization, Summary and the various Comments includes between '(' and ')'
and also the Phrase in for example the From: field and the Keyword-header.

But then the special characters. How to deal with them?

First of all we saw a problem with multi-octet encodings. Some
of the octets might, in the US-ASCII list of characters, look alone
like a special character. To avoid that the list of special characters
must be changed to forbidden octets.

Secondly we have a problem with encodings that produce a lot of special
characters that must be Quoted in a Quoted-Pair (The Comment) or
in a Quoted-String (The Phrase). The first problem arise with
Quoted-Printable. It uses in its present form ':' which is a special
character in the Phrase. Therefore we want to change that ':' into
a non-special character, for example '='. Other non-special characters
includes '&', but that one is used in the Quoted-Readable as a
quoting character.

If we, in a comment, use quoted-pairs on octet-basis we do not
have any problem with them.

In a Phrase, we instead use a quoted-string.





   Patrik F{ltstr|m, Peter Svanberg, Olle J{rnefors, Jan Michael Rynning

==============================================================================
Patrik F{ltstr|m                        Internet: 
paf(_at_)nada(_dot_)kth(_dot_)se
Department of Numerical Analysis        BITNET:   paf(_at_)sekth
  and Computing Science                 Phone: +46-8-7906274
Royal Institute of Technology           Fax:   +46-8-7900930
S-100 44 Stockholm
Sweden