Re: Comment on the draft MIME Part 1 document

Excerpts from Keith Moore's message Tue, 27 Apr 1993 21:57:48 -0400:

We can't prevent people from using 8-bit characters as login names.  However,
if they use that login name for an email address, they won't be able to
receive mail at all from many sites around the world.  My perception is that
this practice is not nearly as widespread as that of sending 8-bit body parts.


Some mail systems for PC LANs will gladly accept non-ASCII
characters for user names and mailboxes, e.g. Microsoft Mail for
Macintosh.  In non-English-speaking countries such capabilities
will be used.

That's not to say that we cannot recommend a way of encoding addresses as
ASCII characters.  (No, the RFC 1342 encoding is NOT appropriate.)


The RFC 1342 _approach_ is appropriate, though: Decide a new
interpretation of legal but seldomly used character sequences in
addresses as _representing_ non-ASCII characters.

+  Old mail software can handle addresses chosen to represent
   non-ASCII characters with no problems.  They will however be
   cryptic to human users.

+  New mail software will display such addresses using the
   intended non-ASCII characters.  Human users will be able to
   read and, in important cases, remember such addresses with
   the same ease as pure ASCII addresses.

An additional advantage:

+  This representation can be used as a standard for translating
   addresses between RFC 822 mail and other mail protocols.

Excerpt from Keith Moore's message Tue, 27 Apr 1993 23:01:12 -0400:

1342 style encodings aren't appropriate because the local part of an address
needs to be (a) unique and (b) opaque.


A third requirement is that addresses should be
(c) as short as possible.

Here is my proposal for an address encoding (which is applicable
to both local-parts and (sub-)domains):

1) The encoding system is based on _one_ coded character set,
   chosen to be able to encode all characters of any local coded
   character set, and one octet encoding method.  The basic
   character set is ISO 10646 on implementation level 3 (in both
   two-octet and four-octet form).  The encoding is BASE64.

   (By the way, the final text of ISO 10646 is now available in
   the 754 pages document ISO/IEC JTC1/SC2 N2420.)

2) "*", "&" and "'" are chosen as representation switching
   characters.

3) The initial representation is ASCII.

4) "*" switches to _two-octet representation_, if followed by a
   BASE64 character.  Following characters are interpreted as
   the BASE64 representation of one or more two-octet 10646
   characters.  No BASE64 fill characters are used in this
   representation.

5) "&" switches to _four-octet representation_, if followed by a
   BASE64 character.  Following characters are interpreted as
   the BASE64 representation of one or more four-octet 10646
   characters.  No BASE64 fill characters are used in this
   representation.

6) "'" switches to ASCII representation.

7) Quoted-string parts of addresses are not interpreted
   according to the two-octet or four-octet representations.

8) The fact that two different octet sequences represent the
   same character sequence according to ISO 10646 does _not_
   imply that the corresponding addresses are equivalent.  When
   deciding e.g. the local-part for a mailbox, a decision
   between the possible representations must be made.

9) If more appropriate methods are not available, it is
   recommended that undisplayable received characters are shown
   by means of the sequence of ASCII characters of their
   encoding.

Example 1: The name

   olle_j<a with diaeresis>rnefors

could be encoded as

   olle_j*5A'rnefors

Example 2: The name

   pushkin

written with Cyrillic letters, which in the two-octet form of
10646 is represented by the octet sequence

   04 3F 04 43 04 48 04 3A 04 38 04 3D

It could be encoded as

   *Az8DQwNIAzoDOAM9

(Names containing no ASCII characters will be more than twice as
long as if an 8-bit coded character set could be used.)

All this presupposes case-sensitivity.  If addresses has to be
case-insensitive, a simple way out would be to represent an
uppercase BASE64 letter by prefixing it with "_" and represent
the lowercase letter by not prefixing it.  (This would further
reduce the encoding efficiency for non-ASCII characters by
40 %.)

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>