Hi,
Greg V has encouraged me to post this, it spite of the mob
that will undoubtedly wish to subject all of us to more screaming ...
Before you hit that Reply-All key, please note:
-- this is cross posted to both lists. You will want to post a
well though out comment to _one_ of them. Right? :-)
-- if you want to scream NOT ACCEPTABLE, or flame Prime Computer,
remember that we have heard it all before.
I will _not_ respond to postings that are devoid of technical
content, nor will I listen to anything but rational argument.
Wehn some posts nasty comments about me or Prime, I feel that
reflects more on the poster than on me, and does not, therefore
require a response.
The Internet is now an international playpen, and it is essential
that Internet mail support _all_ of its users languages as expediently
as possible without compromising the final result. We can ignore
the kid in the corner, bawling his eyes out over the demise of
his US-centric network. :-)
Best Regards,
Robert Ullmann
Network Working Group D. Robinson, R. Ullmann
Internet Draft Prime Computer, Inc.
October 1991
International character support in SMTP
1. Status of this Memo
This memo describes an update to the SMTP protocol, and to the format
of an Internet mail message, to provide support for the character
sets used in the world's many languages. This memo documents existing
usage as well as specifying some additional interoperability
refinements. It updates RFCs 821, 822, and 1090 [4, 1, 6].
The Internet is no longer a creature of the United States, much less
of DARPA (the US Defence Advance Research Projects Agency). It is now
an international network, and the ability to communicate in any of
the world languages on an equal footing is an imperative.
This draft attempts to track the development of ISO 10646 [3], a
moving target at this writing. The reference citation below is to the
previous 10646 draft, with failed in the balloting in June of 1991.
It is therefore expected that this memo will potentially change until
the publication of IS 10646. Some of the following text refers to
10646 in the present tense, as if it is IS now; it should be
understood in this context.
Distribution of this memo is unlimited.
2. Introduction
SMTP has been defined over the TCP as a 7 bit text protocol. While
there is some dispute as to whether this is actually a restriction,
the question is now mooted: Internet mailers are required to pass
the 8th bit when relaying Internet mail on the TCP. It is understood
that some time will pass before all Internet Mail Transfer Agents
(MTAs) can be expected to comply with the new requirement. This
explicitly modifies the provisions of RFC 821 [4].
In addition, a conceptual basis and a specific header field are
described for designating the character set(s) used, when that
character set is not ASCII-7 or AUC (see below); this provides
documentation of character sets being used presently in the Internet,
in the absence of the ISO universal set standard.
Robinson, Ullmann [Page 1]
Internet Draft International character support in SMTP October 1991
Note: This specification provides 8 bit text. It does NOT provide
"transparent" binary; in particular, the mail message is still
represented at the presentation layer (in the ISO model) as an
ordered set of text lines, of limited length.
This specification is written specifically for Internet mail transfer
agents, i.e. those operating as part of the Internet. It may not be
directly applicable to the mail transfer agents of other networks; in
the case of gateways to the Internet, it applies only to the
interface presented to the Internet, not necessarily to the other
network.
Likewise, a host receiving an SMTP mail message for final delivery,
is subject to this specification only in that it should interpret the
incoming message as being in AUC, except where explicitly declared
otherwise as provided below.
3. Motivation
RFC 821 and 822 defined a mail message as a sequence of lines of
7-bit ASCII characters. At the time (1981), ASCII-7 constituted a
reasonably "lingua franca" for mail messages.
Much has changed. Today, SMTP mail is in use by a number of language
groups in character sets other than ASCII-7.
A number of European-language systems send mail in 7-bit ECMA-35
character sets [2] in which specific ASCII-7 characters have been
replaced with local non-ASCII-7 characters. ASCII itself has been
redefined as an 8 bit set (ASCII-8), with code points identical to
the 8-bit codeset ISO8859/1, which is itself one of a set of 8-bit
codesets, all in regular use in mail. Several non-European languages
use 7 and 8-bit multi-octet character sets.
In addition, the ISO is currently working on specification of a
32-bit universal character set, ISO10646 [3], and a related proposal
for ASCII/Universal Character set (AUC) algorithm that would convert
the 32-bit codes into 1-5 octet sequences. The AUC code is
deliberately designed to be useable with existing software, in
particular, it is mailable through an 8-bit SMTP MTA.
Today, to support these newer character sets, most of the SMTP MTAs
being distributed by vendors no longer restrict mail to 7-bits.
This memo documents the existing usage and adds some refinements to
improve safe interoperability with older 7-bit SMTP MTAs.
This memo also proposes a new header field to designate the character
Robinson, Ullmann [Page 2]
Internet Draft International character support in SMTP October 1991
set used in headers during the transition, where not ASCII-7 or AUC.
It uses the Encoding header and structure [5] to identify the
character set(s) used in the content of message when not AUC.
4. Terminology: Octets, Characters, and Character sets
Before proceeding, it is necessary to introduce some definitions.
For the purposes of this document:
- An octet is an 8-bit datum, which may contain values 0 through
255 decimal.
- A character is a conceptual entity, such as "A" or "o-diaresis"
(o with 2 dots over it). The ISO working group has a simple
definition of "coded character": a coded character is something
that the standard assigns a code to.
- A (coded) character set is a transformation algorithm which maps
characters (as defined in UCS, see below) to octets or sequences
of octets.
Note that the same character coded in different character sets may
result in different octets. For example, the character "o-diaresis",
code point 246 decimal in UCS: in the Swedish national variant of the
ECMA-35 character set it is the octet 123, but in IS 8859/1 the octet
246, and AUC it is the 2 octets 160 246.
IS 10646 defines (will define) a 32 bit set, UCS-32, with characters
assigned to integer code points in the range 0. to 4294967295.
The first 128 code points are ASCII-7. The first 256 will almost
certainly be IS 8859/1 (ASCII-8). There is no reservation of octets
corresponding to C0 or C1. For example, LATIN CAPITAL LETTER A is
65., or 00 00 00 41 in hex. The other problem is that the canonical
form is 4 times the raw size of most alphabetic language files today.
(Most now using a non-universal 7 or 8 bit code).
The "C0 committee" defined (working draft) A Transformation Method,
to address these problems. Codes are mapped through an algorithm to a
1-5 octet sequence.
For the purposes of this description, C0-G1 are defined a little
differently than usual, for simplicity:
C0 00 to 20
G0 21 to 7E
C1 7F to 9F
G1 A0 to FF
Robinson, Ullmann [Page 3]
Internet Draft International character support in SMTP October 1991
Ranges of the UCS code space are mapped to ranges of AUC as follows:
UCS-32 (decimal codes) AUC (hexadecimal octets)
0. to 159. 00 to 9F
160. to 255. A0 A0 to A0 FF
265. to 16405. A1 21 to F5 FF
16406. to 233005. F6 21 21 to FB FF FF
233006. to 4294967295. FC 21 21 21 21 to FF 59 3C C8 C3
The octets used in the multi-octet characters are in G0 and in G1.
The octets in C0 and C1 are mapped 1-1, and always represent
themselves.
There are no shifts or locking shifts, a major technical advantage
over the previous draft of 10646. Any C0 or C1 character (e.g.
including SPACE) thus provides a resynchronization point, if an error
occurs.
5. What is a Line?
SMTP messages are composed of lines. A line consists of 0-998 text
octets ending with a 13 (CR) 10 (LF) not included in the count. As
defined in RFC 821, this is a "minimum maximum": an MTA MUST accept
and relay lines of this length, and MAY allow lines of any length.
Text octets are defined to be 7 (BEL), 8 (BS), 9 (TAB), 11 (VT), 12
(FF), 27 (ESC), 32 (SPACE), 33-127 (G0), and 160-255 (G1). These
octets MAY be included in SMTP mail messages and MUST be relayed by
SMTP MTAs.
The following octets are not text octets: 10 (LF), 13 (CR), 138, 141.
These octets MUST NOT be included in text lines. 10 and 13 are used
in the line termination sequence. 138 and 141 are the octets 10 and
13 with the 8th bit set and will cause unexpected results with 7-bit
SMTP MTAs.
Some implementations (usually implicitly, as a consequence of
operating of file system semantics) convert CR and/or LF appearing by
themselves, i.e. "within" a line, to an end of line sequence. This
behavior is (now) valid.
In particular, SMTP MTAs SHOULD accept lines of message text and of
commands which are terminated only with 10 (LF). (This recognizes
the operational reality that a number of existing SMTP MTAs
misinterpreted the end-of-line specification.) Mailers MUST send
lines of message text and commands terminating with 13 (CR) 10 (LF).
All other values are discouraged as text octets. Many are known to
Robinson, Ullmann [Page 4]
Internet Draft International character support in SMTP October 1991
cause difficulties with particular SMTP MTA implementations or with
particular operating systems. Nevertheless, MTAs SHOULD pass these
octets wherever possible.
6. Mail Message Format
6.1. Header Field Keywords
Header field keywords will remain in ASCII-7. This includes all
keywords, such as "with" in a Received header, or "delivery-report"
in a Message-Type header, as well as the header keywords themselves.
There is no expectation that this restriction will ever need to be
relaxed: user agents may recognize keywords and present them to meet
any arbitrary user requirements.
6.2. Header Field Bodies
It is perhaps obvious that unrestricted use of character sets other
than ASCII-7 and AUC in message headers will be a source of problems.
However, use of other sets (notably shift-JIS Kanji and ECMA) is
common current usage.
There are some headers in which it is obviously useful, and should be
permitted (e.g. Subject:) and others in which it may cause an MUA
(Mail User Agent) that does not understand it to mis-parse the
header. In particular, note that characters outside the ASCII-7 set
may be "stripped" by non-compliant MTAs to octets that correspond to
the values of <, >, (, ), and, more painfully, the values of " and \.
Field bodies of "unstructured" header elements (such as Subject:) MAY
contain the full range of text octets without any additional
transformations.
Field bodies of "structured" header elements (such as To:) which
apply "\" (92) escaping MUST be careful to apply the same escaping
not just to "meta" text octets like "<" (60) but also to text octets
in the 160-255 range which, when stripped to 7 bits, match the meta
octets, (e.g., 188).
Refer to RFC 822 for the precise definition of which syntax elements
require special characters to be quoted, and which prohibit it. It is
quite un-ambiguous on this subject.
6.3. X-Header-CharSet: Header
In the beginning (1981 :-), mail messages were in a universal
character set, ASCII-7. In the near future, they may again be in a
universal character set (AUC or something like it). Today, messages
Robinson, Ullmann [Page 5]
Internet Draft International character support in SMTP October 1991
are in a number of character sets, all sent in octets without any
character set identification.
The X-Header-CharSet header field is to be used to identify the
character set used in the header when not ASCII-7 or AUC. In the
absence of an X-Header-CharSet header field, the default character
set is defined to be ASCII-7 or AUC.
MTAs MUST NOT interpret the X-Header-CharSet field, or attempt to
convert from one set to another; the MTA's responsibility is to pass
the bits unmunged. A gateway may, of course, perform whatever
transformation is required into the "foreign" environment. (It
should also be noted that private point-to-point arrangements between
consenting MTAs are outside the scope of this, or indeed any,
standard.)
The character set selected must by ASCII-7 "conformant", that is, it
must assign substantially all of the ASCII-7 characters to the same
code points, to permit keywords and other important elements to be
represented.
IS 2022 methods are conformant, with the shift (back) to ASCII-7
wherever needed. ECMA-35 sets are borderline since they replace
ASCII-7 code points, but should usually not be a problem. (For
example, the German variant replaces the "@" character, a very strict
interpretation might try to make Internet addresses un-representable.
This is, of course, silly.)
The content of the X-Header-CharSet field is a single token in
ASCII-7. Appendix A is a partial list of values for the
X-Header-CharSet field.
The field has an X- name because it is a temporary expedient: as soon
as the International Standard AUC is defined, use of any other set in
headers is (to be) deprecated.
7. Character Set Encodings
The content of the message may be encoded by a number of different
character sets (as explained above, character sets are defined to be
encodings of the IS 10646 UCS code point space).
Encoding: Text
is defined to be text in AUC.
Encoding: <charset> Text
Robinson, Ullmann [Page 6]
Internet Draft International character support in SMTP October 1991
is a encoding (transform) of UCS to the code point assignments
defined by <charset>.
It is perhaps amusing to note that it is not necessary to have the
UCS code points defined to have a complete definition of an encoding
such as ISO_8859-1 Text.
For example, messages in Japanese Kanji in common use now, should
use:
Encoding: ISO_2022 (JIS Kanji) Text
The (optional) comment is useful, to explain to users not being
presented the actual glyphs, that the incomprehensible text following
is really Kanji. The comment MUST NOT be interpreted by any program.
(Particularily in the case of ISO_2022, where the code sets are
selected by the standard escape sequences).
This extends, of course, to multiple parts and other nested encodings
as described by RFC 1154.
A. List of Character set Identifiers
The following is a list of character set identifiers that may be used
as the value of the header field X-Header-CharSet: and in the
Encoding: header. All of these keywords are added to the defined set
for the Encoding: header.
ISO_8859-1 IS 8859/1, commonly called Latin-1.
ISO_8859-2 IS 8859/2, Latin-2
ISO_8859-3 IS 8859/3, Latin-3
ISO_8859-4 IS 8859/4, Latin-4
ISO_8859-5 IS 8859/5, Cyrillic
ISO_8859-6 IS 8859/6, Arabic
ISO_8859-7 IS 8859/7, Greek
ISO_8859-8 IS 8859/8, Hebrew
ISO_8859-9 IS 8859/9, Latin-5
ECMA_35-DK ECMA replaced-code point set for (e.g.) Denmark.
Others in the same pattern: ECMA_35-NO, etc.
Robinson, Ullmann [Page 7]
Internet Draft International character support in SMTP October 1991
ISO_2022 ISO 2022
IBM_CP437 I.B.M. code page 437. For example, others in the same
pattern. It probably is not a good idea to use the
EBCDIC based pages (e.g. 274, 500, etc.)
APPLE_MAC Apple Computer's Macintosh character set
Others... This list makes no claim to be complete at this
draft.
References
[1] David H. Crocker.
Standard for the Format of ARPA Internet Text Messages.
RFC 822, University of Delaware, August, 1982.
[2]
European Computer Manufacturers Association standard 35.
[citation tba with proper title and author].
[3] International Organization for Standardization.
Information technology -- Universal Coded Character Set (UCS).
ISO/IEC DIS 10646, ISO, November, 1990.
(Draft, defeated in June 1991 ballot).
[4] Jon Postel.
Simple Mail Transfer Protocol.
RFC 821, USC Information Sciences Institute, August, 1982.
[5] David Robinson, Robert L. Ullmann.
Encoding Header Field for Internet Messages.
RFC 1154, Prime Computer, April, 1990.
[6] Robert L. Ullmann.
SMTP on X.25.
RFC 1090, Prime Computer, February, 1989.
Author's Address
David Robinson 10-30
Prime Computer, Inc.
500 Old Connecticut Path
Framingham, MA 01701
USA
Robinson, Ullmann [Page 8]
Internet Draft International character support in SMTP October 1991
Phone: +1 508 620 2800 x1774
Email: DRB(_at_)Relay(_dot_)Prime(_dot_)COM
Robert Ullmann 10-30
Prime Computer, Inc.
500 Old Connecticut Path
Framingham, MA 01701
USA
Phone: +1 508 620 2800 x1736
Email: Ariel(_at_)Relay(_dot_)Prime(_dot_)COM
Robinson, Ullmann [Page 9]