Headers: the universal character set option in detail

DRAFT INTERNET DRAFT

  The Meaning of Internet Message Headers

    Ned Freed, Innosoft
    Robert Smart, CSIRO

      October 1991


Abstract

This document suggests extensions to the RFC-822 message
representation to allow the meaning of certain display-only field body
components to be modified in a safe backward compatible way. This will
allow glyphs other than the glyphs in the US-ASCII character set to be
represented in those fields.


1. Introduction

There are a number of compononents of the RFC-822 headers which are
clearly marked in that standard as being intended for human viewing
and do not have any internal computer-processable meaning. It is
only those fields that are affected by this memo, and by the proposed
alternative solutions. The field components involved are the full
text field body of "Subject" and "Comments", and quoted phrases and
comments in the field body of headers which have lists of addresses,
such as "From", "To" and many more.

The need to allow more than US-ASCII in these display fields is a
strong requirement of the non-English speaking parts of the Internet.
This requirement clearly complements the capabilities for the message
body provided by rfc-xxxx.

The key requirement of this (as with alternative proposals) is that it
interact harmlessly with existing rfc-822 based mail software. The
only relevant action of existing mail software is to parse these field
components correctly so that they can be passed over (or extracted and
displayed) correctly. For this to continue to work it is only
necessary that the syntax allowed by this memo is a subset of the
syntax allowed by rfc-822, and it is in fact a proper subset. By this
I mean that the set of messages conforming to this proposed standard
is a proper subset of the set of messages complying with rfc-822
syntax.

This memo defines a new header. As with rfc-xxxx this memo takes a
certain set of rfc-822 messages which are legal but whose meaning is not 
defined by an Internet Standard because they contain undefined headers,
and gives a specific meaning to some while declaring others to be
illegal. As with rfc-xxxx this act of declaring some previously legal
messages illegal is the only theoretical backward compatibility problem
and it is not a practical backward compatibility problem because no
software exists which is generating those illegal messages and future
compliant software will obey this standard.

Appendix A of this document will argue that of the proposals that have
been made to cover this requirement there are 3 which will actually
work. Of these 3 this proposal has particular aesthetic and practical
advantages. This proposal can be called the universal character set
proposal. The other two are cross-reference and internally-specified
character set.


2. The Header-Charset header.

The new Header-Charset header defines a character set and encoding
that applies to all relevant headers in the message. What the relevant
headers are and how the application works is described in later
sections. The format of the field body is defined by

  Header-Charset := charset "/" encoding

The set of possible values for charset and encoding are as defined
in rfc-xxxx and by the IANA. The legal encodings are as allowed by
the Content-TransportEncoding header in RFC-XXXX.

The encoding may only be 8-bit if the Content-TransportEncoding for
the message is 8bit or binary. The encoding may only be binary if
the Content-TransportEncoding is binary. The meaning of binary
encoding is not defined here (as it is not defined in rfc-xxxx) and
any subsequent rfc defining the meaning of binary for rfc-xxxx should
also define the meaning of binary for this header.

The result of the restrictions in the previous paragraph is that the
encoding does not affect the transportability of the message over
particular restricted transport mechanisms.

If this header is absent then it is equivalent to having:

  Header-Charset: us-ascii / 7bit


3. Defined Conversions.

The fact that this mechanism defines only one character set for all the
headers is a potential problem. The problem is solved by the use of
a universal character set. The existence of such a universal character
set is a core assumption of this memo.

The changes defined in this memo are completely ignored by MTAs. The
only times that the relevant headers are added or examined or used is
under the clear control of the sender or recipient. So the
transformations defined here should not be controversial. Further this
memo will also recommend a preferred Header-Charset header for all
new (rfc-xxxx style) messages which will avoid all conversions.

Changing encoding is trivial and is not discussed here. The problem
comes from multiple character sets.

When creating a message the sender may wish to combine components
from various places. E.g. when replying to a message the sender may
take the subject field of the sent message and prefix it with "Re",
and take the From field and make it a To field. He then may wish
to add his own From field which could have a quoted phrase in a 
different character set, and Cc to someone else whose address he
picks out of another message with a potentially 3rd character set.
The message may then get to a point where the recipient causes
the message to be forwarded and Resent headers get added.

The rule is that only the sequence of glyphs is significant. The
bit pattern or character set is not important and may be changed.
In particular the mail software generating a set of headers may
render them all in a universal character set to make them all
compatible. To be even more particular, this memo recommends the
use of the Mnemonic universal character set defined in RFC-MNEM
and RFC-CHAR. This has particular advantages when a message with
a Header-Charset header reaches a recipient which does not
recognise that header and displays the character set in its
encoded form. Another advantage of Mnemonic is that it is a 7-bit
universal character set.

While all these actions take place outside the arena of
standardization, we can define some obvious guidelines. Software
for generating headers from other messages (e.g. reply) should
look at the Header-Charset in the message, and should at least
know how to convert us-ascii to Mnemonic (which is trivial).
Forwarding software that wishes to add Resent headers should
at least be able to upgrade the message's Header-Charset from
US-ASCII to MNEMONIC. It should have some plan for handling 
difficult messages: presumably encapsulation.

[A possible reading of RFC-822 is that Resent headers can only be
added once. In that case we could solve the forwarding software's
problem by having Resent-Header-Charset to apply to Resent headers.
However this would not solve the general problem that headers are
created from a number of sources by the originator of the message.]


4. Application of the Header-Charset header

When a message has a Header-Charset header then the field components
detailed below are interpreted by removing the encoding and interpreting
the result as a sequence of glyphs specified in the nominated
character set. The result after removing the encoding must only be
a sequence of printable glyphs and of linear white space. Any other
motion definition is illegal. For example you can't get a multi-line
Subject by using ","s with a BASE64 encoding.

The character set and encoding defined by the Header-Charset header
applies to the following cases:

 (i)  The full field body of the "Subject" and "Comments" headers.

This causes no problems since the contents of these field bodies is
just a sequence of octets with no extra structure. The required 
sequence of glyphs is rendered in the chosen character set. The
result is then encoded with the chosen encoding and the result goes
without further change into the Subject or Comments field body.

 (ii) Quoted phrases and parenthesized comments in the following
      headers: Reply-To, From, Sender, Resent-Reply-To, Resent-Sender,
      Resent-From, To, Resent-To, Cc, Resent-Cc, Bcc, Resent-Bcc.

In this case the procedure is only slightly different. Render the
required sequence of glyphs in the character set; encode the
result in the specified encoding. Now before placing the result
in the quoted string or comment the RFC-822 quoting rules have to
be applied. This means convert backslashes and double-quotes to
quoted pairs ('\\' and '\"') and also convert parentheses to quoted
pairs in parenthesized comments ('\(' and '\)'). Well maybe you
don't need to quote double-quotes in comments but I wouldn't risk
it.

This process is easily invertible: de-quote, decode, interpret
as glyphs.

It has been observed that this could increase the use of rfc-822's
quoting conventions. This is an area where there have been in the past
a number of buggy implementations. Mistakes in parsing in this area
could lead to failure to detect the ends of quoted strings or
comments, and so loss of "real" information. The recommended
Header-Charset described below utilizes the encoding option to
actually reduce the use of the rfc-822 quoted-pair convention.


5. The Recommended Header-Charset

The recommended Header-Charset is:

    Header-Charset: mnemonic / quoted-printable

The normal use of quoted-printable encoding is to allow a character
set which uses octets greater than 127 to be transportable in a
7-bit environment. So it seems a little odd to encode Mnemonic which
is a pure 7-bit character set. Quoted-printable has little affect
on pure 7-bit input. It just requires that all ":"s be doubled. Its
advantage in this particular case is that it can be used within
quoted strings and comments to avoid RFC-822's quoted-pair rule.

So first the required sequence of glyphs is rendered in Mnemonic.  The
next step is conversion to quoted-printable. At this point it is
confusing to talk about the octets as if they were glyphs [e.g.  octet
3A(hex) might be a ":" glyph or part of some multi-octet glyph].  So I
will talk in hexadecimal. Occurrences of octet 3A must be doubled.  In
addition if the result is to go in a quoted string then any 5C octets
(backslash) and 22 octets (double-quote) should be converted to 3A 41 43 
(:5C) and 3A 32 32 (:22) respectively. In addition if the result is to 
go into a parenthesized comment then any 28 octet (left parenthesis) and 
29 octet (right parenthesis) should be converted to 3A 32 38 (:28) and 
3A 32 39 (:29) respectively.

Because of these actions, when handling quoted strings or parenthesized
comments there will never be a need to invoke the rfc-822 quoted-pair
mechanism. The cost is not insignificant when interacting with
old software that does not understand the Header-Charset header. In
that case the doubling of all the ":"s in the Subject line particularly
may cause confusion, and some of Mnemonic representations of glyphs
will look a lot less readable. If a survey were to find that the
fear of bugs in handling the rfc-822 parsing rules was unjustified 
then the preferred Header-Charset would be Mnemonic/7bit.

The converse is that mail-writers and UAs SHOULD NOT generate
messages with a character set other than Mnemonic or US-Ascii. If
there is a desire to break this for compatibility with local editors
and mail software then the local mail software SHOULD be structured
to funnel outgoing mail through a gateway which converts the
internally preferred character set to Mnemonic in the headers.

At some later stage Mnemonic might be replaced in this recommendation
by a new comprehensive standard universal character set (such as AUC).
At that time the recommendations for generating and forwarding 
software will specify that such software SHOULD be able to upgrade
both US-ASCII and Mnemonic to the new character set.


Appendix A. Header Options

Because headers get generated from a variety of sources there needs
to be some mechanism for handling incompatibilities between the
character sets of the various sources, or some way to allow multiple
character sets. The solution proposed in this document is to
assume that there is a universal character set and that things can
be combined by universalizing them.

Some of the other proposals do not meet the requirement of handling
headers put together from a variety of sources. This includes the
Real-* proposal and any scheme that allows a single character set for
all the headers to be specified but doesn't mention the key role of a
universal character set.  Two that do meet the requirement are as
follows:

1. Cross-Reference. 

This appeared in the first draft of rfc-xxxx.  The idea is that you
replace the thing you want in a separate character set with
$variable-name, and define the character set, encoding and value of
the variable in a separate Encoded-Variable header.

A semblance of residual readability is maintained with this scheme by
generating the name from the data by replacing the non-usascii
characters with "X" and white space with "_". This works ok for some
quoted phrases in addresses but whether it would so well for
Subject lines is not clear. 

There is no problem with multiple character sets when merging
information from different places into the headers. However one would
have to search for variable clashes and possibly rename [like
something out of lambda calculus].

One advantage of this scheme is that it avoids quoted phrases
altogether by using the atom form of phrase. I don't think this
is of major importance - simple quoted phrases are not a problem.

Another feature of this scheme that could be an advantage or a
disadvantage (depending on the way you look at it) is that it could be
extended at a later date to put all sorts of funny things in quoted
strings, subjects, etc. It could be extended to allow a full
content-type specification for the variable allowing audio, pictures,
your CV in TeX, etc to all be put in the headers. Whether this would
mean anything sensible is doubtful. Even the idea of allowing
Text-plus/richtext as the Swedes would like is full of problems (silly
states) though it has its advantages as described in the next
sub-section.

This dropped out of rfc-xxxx presumably because people saw it as 
unnecessarily complex. The authors still think that is true. The issues 
are well enough understood now for the working group to evaluate whether 
the complexity is worth it.

2. Richtext (internal character set name). 

During a lively private discussion with Nathaniel Borenstein we
realised that the Swedish proposal included the germ of a 3rd option
that would work, and I think this Richtext option has also been
discovered by others.

The Swedish proposal wanted to allow more than just the character set
of header information. They wanted a Header-Content-Type which could
be Text or Text-plus. Because Text-plus/richtext allows switching
character sets it would be possible to easily combine information from
various places. However headers like this don't have much appeal:

  From: "<mnemonic> Keld J&o/rn Simonsen </mnemonic>" <keld(_at_)dkuug(_dot_)dk>

If this was introduced by a header which defined the default richtext
character set, and people usually used mnemonic then this wouldn't be
too bad.

However the idea of extending header display to handle the funny
things that richtext allows seems undesirable (let alone any other
form of Text-plus).  Many text-plus/richtext features would be
meaningless.

If you allow Text-plus/richtext in the headers then allowing Text as
well is unnecessary [and brings back all the conversion issues]. The
obvious thing to do in this case is to introduce it with a single
header which could still be

        Header-Charset: charset / encoding

The presence of this header will mean that the header display fields
are in richtext with the specified charset being the default. The
actual charset could be switched in different components through the
richtext/sgml-like syntax.

3. Universal Character set. To complete the set of 3 let me return
to the proposal of this document. At the core of that proposal is
the recommendation that we use Mnemonic as the universal
character set in headers. This makes a lot of sense because 

  a. We need to keep the headers 7-bit for backward compatibility.

  b. We need to interoperate on a reasonable basis with 7-bit 
     rfc-822-oriented mailers for quite a long time.

  c. It is the simplest solution to the problem.

In fact by encouraging support for Mnemonic we will help improve
contact between the old world and the new in areas beyond just the
headers.