Yet another proposal for non-ASCII chars in headers

What follows is another proposal for the representation of non-ASCII
characters in message headers.   Basically, it uses a set of
delimiters to denote "words" that are encoded using "safe", printable 
characters: the "words" also include abbreviated character set 
information.

The result, I think, is sometimes nearly satisfactory and sometimes 
downright ugly when displayed on existing software.  Nevertheless, 
it attempts to avoid the operational problems of some of the other 
schemes proposed:  It should not cause existing mail software to 
break or be confused, and it should preserve the text information even 
though the message headers have been munged in various ways by 
internetwork mail gateways.

I encourage comments both on how this proposal can be improved and
on how it compares with the other proposals on the table.

Keith



Network Working Group                                   Keith Moore
DRAFT INTERNET DRAFT                        University of Tennessee
                                                    21 October 1991


   Representation of Non-ASCII Text in Internet Message Headers


I.  Status of this memo

This memo describes an extension to the message format defined in RFC
XXXX, to allow the representation of character sets other than ASCII
in RFC 822 message headers.  The extensions described were designed to
be highly compatible with existing Internet mail handling software,
and to be easily implemented in mail readers that support RFC XXXX.

This memo is being submitted to the IETF Message Format Extensions
Working Group for consideration as a standard for the Internet
community.   Distribution of this memo is unlimited.


II.  Introduction

RFC XXXX (1) describes a mechanism for denoting textual body parts
which are coded in various character sets, as well as methods for
encoding such body parts as sequences of printable ASCII characters.
This memo describes similar techniques to allow the encoding of
non-ASCII text in various portions of a RFC 822 (2) message header, in
a manner which is unlikely to confuse existing message handling
software.

Like the encoding techniques described in RFC XXXX, the techniques
outlined here were designed to allow the use of non-ASCII characters
in message headers in a way which is unlikely to be disturbed by the
quirks of existing Internet mail handling programs.  In particular,
some mail relaying programs are known to (a) delete some message
headers while retaining others, (b) rearrange the order of addresses
in To or Cc headers, (c) rearrange the order of message headers,
and/or (d) "wrap" message headers at different places than those in 
the original message.  In addition, some mail reading programs are 
known to have difficulty correctly parsing message headers which, 
while legal according to RFC 822, make use of backslash-quoting to 
"hide" special characters such as "<", ",", or ":", or which exploit 
other infrequently-used features of that specification.

While it is unfortunate that these programs do not correctly interpret
RFC 822 headers, to "break" these programs would cause severe
operational problems for the Internet mail system.  The extensions
described in this memo therefore do not rely on little-used features
of RFC 822.  Instead, certain sequences of "ordinary" printable ASCII
characters (which are assumed to be unlikely to otherwise appear in 
message headers) are reserved for use as encoded data.  The characters 
used in these encodings are restricted to those which do not have special 
meanings in the context in which the encoded text appears.


III. Encoding

An "encoded-word" is defined by the following EBNF grammar, using the
notation of RFC 822:


encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="

charset = charset-name / charset-number

charset-name = etoken 

charset-number = 1*digit

encoding = etoken   ; Either "B" or "Q"

etoken = 1*<Any CHAR except SPACE, CTLs, and especials>

especials = "(" / ")" / "<" / ">" / "@" / "," / ":" / ":" / "\" / <"> /
            "." / "[" / "]" / "?" / "/" / "="

encoded-text = 1*<Any printable ASCII character other than "?" or SPACE>
               ; (but see "Use of encoded-words in message headers",
               ;  below)


An encoded-word may not be more than 75 characters long (including
charset, encoding, encoded-text, and delimiters).  If it is desirable
to encode more text than will fit in an encoded-word of 75 characters,
multiple encoded-words may be used.  Message header lines that contain
one or more encoded-words should be no more than 76 characters long.


IV.  Character sets

The set of legal values for "charset-name" is the same as that which
is legal for an RFC XXXX "text-subtype".  For the sake of brevity, a
charset-number may be used instead of a charset-name when a number has
been assigned for that character set.  The initial assignment of
values for "charset-number" is as follows:

    1   IS 8859/1   (Latin-1)
    2   IS 8859/2   (Latin-2)
    3   IS 8859/3   (Latin-3)
    4   IS 8859/4   (Latin-4)
    5   IS 8859/5   (Cyrillic)
    6   IS 8859/6   (Arabic)
    7   IS 8859/7   (Greek)
    8   IS 8859/8   (Hebrew)
    9   IS 8859/9   (Latin-5)

Charset-number "0" is reserved for future use with some variant of ISO
standard 10646.  Other numbers may be assigned at a later date by the
Internet Assigned Numbers Authority (IANA).

Initially, the legal values for "encoding" are "Q" and "B".  These
encodings are described below.


V.  The "B" encoding

The "B" encoding is identical to the "BASE64" encoding defined by RFC
XXXX, except that there is no way to represent an end-of-line within a
"B"-encoded encoded-word.  The comma (",") is therefore not used in
the "B" encoding.


VI.  The "Q" encoding

The "Q" encoding is similar to the "Quoted-Printable" encoding defined
in RFC XXXX.  It is designed to allow text containing mostly ASCII
characters to be decipherable on an ASCII terminal without decoding.

1.  Any 8-bit value may be represented by a "=" followed by two 
    hexadecimal digits.  For example, if the character set in use were
    ISO 8859/1, the "=" character would thus be encoded as "=3D", and
    a SPACE by "=20".

2.  The 8-bit hexadecimal value 20 (e.g. IS 8859/1 SPACE) may be
    represented as "_" (underscore, ASCII 95.).  (This character may
    not pass through some internetwork mail gateways, but its use will
    greatly enhance readability of "Q" encoded data with mail readers
    that do not support this encoding.)  Note that the "_" always
    represents hexadecimal 20, even if the SPACE character occupies
    a different code position in the character set in use.

3.  8-bit values which correspond to printable ASCII characters
    other than "=", "?", "_" (underscore), and SPACE may be
    represented as those characters.  (But see "Use of encoded-words
    in message headers", below).


VII.  Use of encoded-words in message headers.

A sequence of one or more encoded-words is used to represent non-ASCII
textual data within a message header.  An encoded-word must be separated 
from any adjacent encoded-words, "word"s, "text", "ctext", or "qtext" 
by a linear white-space character or an end-of-line.

When multiple encoded-words appear in the same header, separated only 
by ends-of-lines or linear white space, the ends-of-lines or white space 
are not displayed.  This allows long pieces of textual data to be 
represented by the concatenation of two or more encoded-words without 
the introduction of extra spaces or line breaks in the decoded output.

An encoded-word may appear in any of the following places in an RFC
822 message header:

- Anywhere a "text" entity, as defined by RFC 822, is allowed in a 
  message header.  (e.g.  the Subject header.)

- Within a comment delimited by "(" and ")", i.e., wherever a
  "ctext" is allowed.

- As a replacement for the "word" entity in a "phrase", e.g. one that
  precedes an address in a From, To, or Cc header.  In this case the
  set of characters that may be used in the encoded-word is restricted
  to: <upper and lower case ASCII letters, decimal digits, "!", "*", 
  "+", "-", "/", "=", and "_" (underscore, ASCII 95.)>.

- Within a quoted string, i.e., anywhere a "qtext" entity is allowed.
  The same restrictions on characters apply here as for a "word".

Whenever such words appear in a header being displayed, an enlightened
mail reader will decode the text and render it appropriately.

Only textual data (printable and white space characters) should be
encoded using this scheme.  However, since these encoding schemes
allow the encoding of arbitrary 8-bit values, mail readers that
implement this decoding should also ensure that display of the decoded
data on the recipient's terminal will not cause unwanted side-effects.

Use of these methods to encode non-textual data (e.g. pictures or
sounds) is not defined by this memo.

Use of encoded-words to represent strings of US-ASCII characters is
valid (since ASCII is a subset of IS 8859/1), but discouraged.


VIII.  Recognition of encoded-words in message headers.

An encoded-word may be distinguished from an ordinary "word", "text",
"ctext", or "qtext" as follows:

1.  An encoded-word begins with "=?"

2.  An encoded-word ends with "?="

3.  An encoded-word contains exactly four "?" characters, including the
    beginning "=?" and ending "?=" delimiters.

If the "word", "text", "ctext", or "qtext" does not meet the above
tests, it should be displayed as it appears in the message
header.  If the mail reader does not support the character set used,
it may either display the encoded text (i.e. as it appears in the
header), or it may substitute an appropriate message indicating that
the decoded text could not be displayed.


IX.  Example

From: =?0?Q?Keith_Moore?= <moore(_at_)cs(_dot_)utk(_dot_)edu>

(others to be supplied in a later draft)


X.  Security considerations

Security considerations are not discussed in this memo.


XI.  References


(1) Nathaniel Borenstein and Ned Freed, "Mechanisms for Specifying and
    Describing the Format of Internet Message Bodies" (RFC XXXX), Internet 
    Draft, October 1991.

(2) David H. Crocker, "Standard for the Format of ARPA Internet Text
    Messages", RFC 822, August 1982.


XII.  Author's address

Keith Moore
University of Tennessee
107 Ayres Hall
Knoxville TN 37996-1301
USA

Internet: moore(_at_)cs(_dot_)utk(_dot_)edu