What follows is another proposal for the representation of non-ASCII
characters in message headers. Basically, it uses a set of
delimiters to denote "words" that are encoded using "safe", printable
characters: the "words" also include abbreviated character set
information.
The result, I think, is sometimes nearly satisfactory and sometimes
downright ugly when displayed on existing software. Nevertheless,
it attempts to avoid the operational problems of some of the other
schemes proposed: It should not cause existing mail software to
break or be confused, and it should preserve the text information even
though the message headers have been munged in various ways by
internetwork mail gateways.
I encourage comments both on how this proposal can be improved and
on how it compares with the other proposals on the table.
Keith
Network Working Group Keith Moore
DRAFT INTERNET DRAFT University of Tennessee
21 October 1991
Representation of Non-ASCII Text in Internet Message Headers
I. Status of this memo
This memo describes an extension to the message format defined in RFC
XXXX, to allow the representation of character sets other than ASCII
in RFC 822 message headers. The extensions described were designed to
be highly compatible with existing Internet mail handling software,
and to be easily implemented in mail readers that support RFC XXXX.
This memo is being submitted to the IETF Message Format Extensions
Working Group for consideration as a standard for the Internet
community. Distribution of this memo is unlimited.
II. Introduction
RFC XXXX (1) describes a mechanism for denoting textual body parts
which are coded in various character sets, as well as methods for
encoding such body parts as sequences of printable ASCII characters.
This memo describes similar techniques to allow the encoding of
non-ASCII text in various portions of a RFC 822 (2) message header, in
a manner which is unlikely to confuse existing message handling
software.
Like the encoding techniques described in RFC XXXX, the techniques
outlined here were designed to allow the use of non-ASCII characters
in message headers in a way which is unlikely to be disturbed by the
quirks of existing Internet mail handling programs. In particular,
some mail relaying programs are known to (a) delete some message
headers while retaining others, (b) rearrange the order of addresses
in To or Cc headers, (c) rearrange the order of message headers,
and/or (d) "wrap" message headers at different places than those in
the original message. In addition, some mail reading programs are
known to have difficulty correctly parsing message headers which,
while legal according to RFC 822, make use of backslash-quoting to
"hide" special characters such as "<", ",", or ":", or which exploit
other infrequently-used features of that specification.
While it is unfortunate that these programs do not correctly interpret
RFC 822 headers, to "break" these programs would cause severe
operational problems for the Internet mail system. The extensions
described in this memo therefore do not rely on little-used features
of RFC 822. Instead, certain sequences of "ordinary" printable ASCII
characters (which are assumed to be unlikely to otherwise appear in
message headers) are reserved for use as encoded data. The characters
used in these encodings are restricted to those which do not have special
meanings in the context in which the encoded text appears.
III. Encoding
An "encoded-word" is defined by the following EBNF grammar, using the
notation of RFC 822:
encoded-word = "=" "?" charset "?" encoding "?" encoded-text "?" "="
charset = charset-name / charset-number
charset-name = etoken
charset-number = 1*digit
encoding = etoken ; Either "B" or "Q"
etoken = 1*<Any CHAR except SPACE, CTLs, and especials>
especials = "(" / ")" / "<" / ">" / "@" / "," / ":" / ":" / "\" / <"> /
"." / "[" / "]" / "?" / "/" / "="
encoded-text = 1*<Any printable ASCII character other than "?" or SPACE>
; (but see "Use of encoded-words in message headers",
; below)
An encoded-word may not be more than 75 characters long (including
charset, encoding, encoded-text, and delimiters). If it is desirable
to encode more text than will fit in an encoded-word of 75 characters,
multiple encoded-words may be used. Message header lines that contain
one or more encoded-words should be no more than 76 characters long.
IV. Character sets
The set of legal values for "charset-name" is the same as that which
is legal for an RFC XXXX "text-subtype". For the sake of brevity, a
charset-number may be used instead of a charset-name when a number has
been assigned for that character set. The initial assignment of
values for "charset-number" is as follows:
1 IS 8859/1 (Latin-1)
2 IS 8859/2 (Latin-2)
3 IS 8859/3 (Latin-3)
4 IS 8859/4 (Latin-4)
5 IS 8859/5 (Cyrillic)
6 IS 8859/6 (Arabic)
7 IS 8859/7 (Greek)
8 IS 8859/8 (Hebrew)
9 IS 8859/9 (Latin-5)
Charset-number "0" is reserved for future use with some variant of ISO
standard 10646. Other numbers may be assigned at a later date by the
Internet Assigned Numbers Authority (IANA).
Initially, the legal values for "encoding" are "Q" and "B". These
encodings are described below.
V. The "B" encoding
The "B" encoding is identical to the "BASE64" encoding defined by RFC
XXXX, except that there is no way to represent an end-of-line within a
"B"-encoded encoded-word. The comma (",") is therefore not used in
the "B" encoding.
VI. The "Q" encoding
The "Q" encoding is similar to the "Quoted-Printable" encoding defined
in RFC XXXX. It is designed to allow text containing mostly ASCII
characters to be decipherable on an ASCII terminal without decoding.
1. Any 8-bit value may be represented by a "=" followed by two
hexadecimal digits. For example, if the character set in use were
ISO 8859/1, the "=" character would thus be encoded as "=3D", and
a SPACE by "=20".
2. The 8-bit hexadecimal value 20 (e.g. IS 8859/1 SPACE) may be
represented as "_" (underscore, ASCII 95.). (This character may
not pass through some internetwork mail gateways, but its use will
greatly enhance readability of "Q" encoded data with mail readers
that do not support this encoding.) Note that the "_" always
represents hexadecimal 20, even if the SPACE character occupies
a different code position in the character set in use.
3. 8-bit values which correspond to printable ASCII characters
other than "=", "?", "_" (underscore), and SPACE may be
represented as those characters. (But see "Use of encoded-words
in message headers", below).
VII. Use of encoded-words in message headers.
A sequence of one or more encoded-words is used to represent non-ASCII
textual data within a message header. An encoded-word must be separated
from any adjacent encoded-words, "word"s, "text", "ctext", or "qtext"
by a linear white-space character or an end-of-line.
When multiple encoded-words appear in the same header, separated only
by ends-of-lines or linear white space, the ends-of-lines or white space
are not displayed. This allows long pieces of textual data to be
represented by the concatenation of two or more encoded-words without
the introduction of extra spaces or line breaks in the decoded output.
An encoded-word may appear in any of the following places in an RFC
822 message header:
- Anywhere a "text" entity, as defined by RFC 822, is allowed in a
message header. (e.g. the Subject header.)
- Within a comment delimited by "(" and ")", i.e., wherever a
"ctext" is allowed.
- As a replacement for the "word" entity in a "phrase", e.g. one that
precedes an address in a From, To, or Cc header. In this case the
set of characters that may be used in the encoded-word is restricted
to: <upper and lower case ASCII letters, decimal digits, "!", "*",
"+", "-", "/", "=", and "_" (underscore, ASCII 95.)>.
- Within a quoted string, i.e., anywhere a "qtext" entity is allowed.
The same restrictions on characters apply here as for a "word".
Whenever such words appear in a header being displayed, an enlightened
mail reader will decode the text and render it appropriately.
Only textual data (printable and white space characters) should be
encoded using this scheme. However, since these encoding schemes
allow the encoding of arbitrary 8-bit values, mail readers that
implement this decoding should also ensure that display of the decoded
data on the recipient's terminal will not cause unwanted side-effects.
Use of these methods to encode non-textual data (e.g. pictures or
sounds) is not defined by this memo.
Use of encoded-words to represent strings of US-ASCII characters is
valid (since ASCII is a subset of IS 8859/1), but discouraged.
VIII. Recognition of encoded-words in message headers.
An encoded-word may be distinguished from an ordinary "word", "text",
"ctext", or "qtext" as follows:
1. An encoded-word begins with "=?"
2. An encoded-word ends with "?="
3. An encoded-word contains exactly four "?" characters, including the
beginning "=?" and ending "?=" delimiters.
If the "word", "text", "ctext", or "qtext" does not meet the above
tests, it should be displayed as it appears in the message
header. If the mail reader does not support the character set used,
it may either display the encoded text (i.e. as it appears in the
header), or it may substitute an appropriate message indicating that
the decoded text could not be displayed.
IX. Example
From: =?0?Q?Keith_Moore?= <moore(_at_)cs(_dot_)utk(_dot_)edu>
(others to be supplied in a later draft)
X. Security considerations
Security considerations are not discussed in this memo.
XI. References
(1) Nathaniel Borenstein and Ned Freed, "Mechanisms for Specifying and
Describing the Format of Internet Message Bodies" (RFC XXXX), Internet
Draft, October 1991.
(2) David H. Crocker, "Standard for the Format of ARPA Internet Text
Messages", RFC 822, August 1982.
XII. Author's address
Keith Moore
University of Tennessee
107 Ayres Hall
Knoxville TN 37996-1301
USA
Internet: moore(_at_)cs(_dot_)utk(_dot_)edu