Excerpts from Keith Moore's message Tue, 27 Apr 1993 21:57:48 -0400:
We can't prevent people from using 8-bit characters as login names. However,
if they use that login name for an email address, they won't be able to
receive mail at all from many sites around the world. My perception is that
this practice is not nearly as widespread as that of sending 8-bit body parts.
Some mail systems for PC LANs will gladly accept non-ASCII
characters for user names and mailboxes, e.g. Microsoft Mail for
Macintosh. In non-English-speaking countries such capabilities
will be used.
That's not to say that we cannot recommend a way of encoding addresses as
ASCII characters. (No, the RFC 1342 encoding is NOT appropriate.)
The RFC 1342 _approach_ is appropriate, though: Decide a new
interpretation of legal but seldomly used character sequences in
addresses as _representing_ non-ASCII characters.
+ Old mail software can handle addresses chosen to represent
non-ASCII characters with no problems. They will however be
cryptic to human users.
+ New mail software will display such addresses using the
intended non-ASCII characters. Human users will be able to
read and, in important cases, remember such addresses with
the same ease as pure ASCII addresses.
An additional advantage:
+ This representation can be used as a standard for translating
addresses between RFC 822 mail and other mail protocols.
Excerpt from Keith Moore's message Tue, 27 Apr 1993 23:01:12 -0400:
1342 style encodings aren't appropriate because the local part of an address
needs to be (a) unique and (b) opaque.
A third requirement is that addresses should be
(c) as short as possible.
Here is my proposal for an address encoding (which is applicable
to both local-parts and (sub-)domains):
1) The encoding system is based on _one_ coded character set,
chosen to be able to encode all characters of any local coded
character set, and one octet encoding method. The basic
character set is ISO 10646 on implementation level 3 (in both
two-octet and four-octet form). The encoding is BASE64.
(By the way, the final text of ISO 10646 is now available in
the 754 pages document ISO/IEC JTC1/SC2 N2420.)
2) "*", "&" and "'" are chosen as representation switching
characters.
3) The initial representation is ASCII.
4) "*" switches to _two-octet representation_, if followed by a
BASE64 character. Following characters are interpreted as
the BASE64 representation of one or more two-octet 10646
characters. No BASE64 fill characters are used in this
representation.
5) "&" switches to _four-octet representation_, if followed by a
BASE64 character. Following characters are interpreted as
the BASE64 representation of one or more four-octet 10646
characters. No BASE64 fill characters are used in this
representation.
6) "'" switches to ASCII representation.
7) Quoted-string parts of addresses are not interpreted
according to the two-octet or four-octet representations.
8) The fact that two different octet sequences represent the
same character sequence according to ISO 10646 does _not_
imply that the corresponding addresses are equivalent. When
deciding e.g. the local-part for a mailbox, a decision
between the possible representations must be made.
9) If more appropriate methods are not available, it is
recommended that undisplayable received characters are shown
by means of the sequence of ASCII characters of their
encoding.
Example 1: The name
olle_j<a with diaeresis>rnefors
could be encoded as
olle_j*5A'rnefors
Example 2: The name
pushkin
written with Cyrillic letters, which in the two-octet form of
10646 is represented by the octet sequence
04 3F 04 43 04 48 04 3A 04 38 04 3D
It could be encoded as
*Az8DQwNIAzoDOAM9
(Names containing no ASCII characters will be more than twice as
long as if an 8-bit coded character set could be used.)
All this presupposes case-sensitivity. If addresses has to be
case-insensitive, a simple way out would be to represent an
uppercase BASE64 letter by prefixing it with "_" and represent
the lowercase letter by not prefixing it. (This would further
reduce the encoding efficiency for non-ASCII characters by
40 %.)
--
Olle Jarnefors, Royal Institute of Technology, Stockholm
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>