Non-ASCII Internet addresses? (Was: Comment on the draft MIME Part 1 doc

The RFC 1342 _approach_ is appropriate, though: Decide a new
interpretation of legal but seldomly used character sequences in
addresses as _representing_ non-ASCII characters.

+  Old mail software can handle addresses chosen to represent
   non-ASCII characters with no problems.  They will however be
   cryptic to human users.

+  New mail software will display such addresses using the
   intended non-ASCII characters.  Human users will be able to
   read and, in important cases, remember such addresses with
   the same ease as pure ASCII addresses.


I don't think this is a good idea.

It's important that the "displayed" form of an address be identical to
the way you spell it when you type it in. ...


I agree with this but I don't see how it can be an argument
against my proposal.  There is no reason why mail software,
implementing the rules for representing non-ASCII characters,
can't be fully symmetric in the sense that all non-ASCII
characters that can be displayed can also be input by normal
keyboard methods.  On my Swedish keyboard there is a key for the
letter A WITH DIAERESIS.  To write an address containing this
letter (by means of my proposed encoding) I will press this key.
From the point of view of a user of a new mail program the
displayed address will be identical to the address he enters by
the keyboard.  In both cases he sees the non-ASCII characters.

... I need to be able to give
the address to a friend on paper or a business card.  What happens if
his system doesn't support the same character set mine does?


On my business card my email address would be given in two
forms, one containing non-ASCII characters, the other being the
real sequence of ASCII characters that form the RFC 822 address.
This isn't different from other possible solutions of the
"non-ASCII characters in addresses" problem I think.

(a) Uniqueness - There should not be several different ways to spell an
electronic mail address.  Likewise, a "mapped" address must identify at
most one mailbox.  Ideally, two addresses can be easily compared to see
if they identify the same mailbox.


In my proposal there is only one address for each mailbox, the
RFC 822 address which is a sequence of ASCII characters.  It can
however be _viewed_ in two different well-defined ways, as a
readable meaningful sequence of characters from the wide
repertoire of ISO 10646 or as the real sequence of ASCII
characters.  (There is a small problem with different possible
10646 representations of the same character.  It can be solved
by additional rules for how to use 10646 in the context of mail
addresses.)

(b) Opacity - The format of the local-part of an address is specific to
the mail domain.  Mail handling software should avoid making
assumptions about this format or applying any transformations to it.  A
user name encoding scheme should not change this rule. ...


I don't think this is a problem with my proposal either.  It
doesn't restrict the range of possible addresses.  It doesn't
use "%" or "." or "!" for encoding purposes.

(c) Restricted character set - The mapped address should fit within a
sufficiently small character set that it need not be encoded again, for
example, using the techniques defined in RFC 1137.  Furthermore, it
should survive translation into the addresses used by other message
handling systems such as X.400(84).  (Basically this restricts to the
PrintableString character set)


I didn't intend to solve the problem of reducing the characters
allowed in Internet addresses to the primitive set usable in
X.400(84) addresses, viz. Printable String characters:

   A-Z a-z 0-9 '()+,-./:=?

If that really is a requirement, the four special encoding characters
of my proposal:            *&'_
could be changed to e.g.:  =/'+

(d) Terse - Email addresses should be easy to type without errors.


I think my encoding gives acceptably short addresses.  By
complicating the rules somewhat it can be made even more
efficient.  Almost all addresses will use at most one
script in addition to the Latin script.  Most alphabetical
scripts fall within one row in ISO 10646, i.e. the first octet of
the two-octet form is the same.  A new prefix representation
could be defined to make it possible for example to encode the
Russian form of "pushkin", in two-octet 10646

   04 3F 04 43 04 48 04 3A 04 38 04 3D

to

   =04P0NIOjg9Cg

instead of

   *Az8DQwNIAzoDOAM9

Of course typing errors are more likely with these address parts
than with normal address parts with the same number of ASCII
characters (which often are pronouncable).  After a transitional
period this will not be a problem within a country using another
language than English.  In Russia (for this example) most users
will be able to input the address for "pushkin" by Cyrillic
letters on their Cyrillic keyboards and will not have to deal
with the basic ASCII address at all.  For international mail
exchange the ASCII form of the address is still usable, but
people with more than sporadic international correspondence will
probably want to have a second email address containing their
name in a form transcribed to English.  (That address will of
course not have to be algorithmically computable from the
Cyrillic address.)

(e) Obviousness - Ideally, the mapping to ASCII should be "obvious",
i.e., easily guessable by a human who knows the recipient's user name.


This wish is not fulfilled at all by my proposal.  As the
previous discussion on this list about menmonic encodings (such
as Keld Simonsen's, defined in RFC 1345) has shown, it _is_
possible to design at least guessable representations of Latin
letters with many different diacritical marks using the
invariant part of ISO 646-compatible 7-bit character sets.
But:

a) Even these "European" mnemonic codes are not obvious in most
   cases.

b) They use several characters not allowed in atoms: ",.:;<>

c) ISO 10646 covers many other scripts than the Latin script.
   For these it's impossible (ideographic characters) or very
   difficult (alphabetical scripts) to attain any mnemonic value
   at all to an ordinary user.

d) Not even Keld's system, the most ambitious of all mnemonic
   representations, has yet been extended to cover the whole
   repertoire of ISO 10646.

e) There will be considerable maintenance problems with any
   mnemonic scheme as new scripts and characters are added to
   ISO 10646.

f) The customary _transcription_ of e.g. Cyrillic letters to
   ASCII is dependent on both the source language and the target
   language.  As an example, different rules are used when
   transcribing _from_ Russian or Serbocroatian and different
   rules are used when transcribing _to_ English ("Pushkin") or
   Swedish ("Pusjkin").  Furthermore these transcription rules
   are seldom reversible and often very complicated to define in
   syntactic terms, it that is at all possible.

g) _Transliteration_ systems can be designed that are
   reversible, but they will not be "easily guessable" by
   ordinary users, let alone "obvious".  And the complexity of
   the transliteration rules needed to cover all scripts served
   by ISO 10646 will be enormous.

h) A mnemonic encoding will on the average give longer addresses
   than my encoding.  Even in Keld's system codes longer than
   two ASCII characters are needed for most 10646 characters.

i) I admit that it's nice if the representation for non-ASCII
   characters can be made so simple that addresses can be
   encoded and decoded "by hand", but I think this is much less
   important than the other requirements.  We should educate our
   computers to perform such mechanical tasks. 8-)

It's fairly easy to see that no mapping of any large character set to
ASCII will meet all of these requirements, but it might be possible for
some character sets and languages, (say by translating "=F6" to "#oe#").


But what about the Danish-Norwegian O WITH STROKE, (ISO 8859-1:
F8)?  The customary transliteration of that character will also
give "#oe#".  And then there is LIGATURE OE, which exists as 7A
in ISO 6937 and CCITT T.61, used in X.400(88) ...

... Any proposal that severely penalizes
someone who uses his correctly-spelled name as a login id (say, by
encoding it in base64) will not meet wide acceptance...we want to make
his life more pleasant, not less.


With my proposal the life of this user _will_ be more pleasant,
since his email address, as he sees it, can be equal to his real
name.  It's true that it will be a BASE64'ed alphabetical soup
on the fundamental level but _he_ will not see it, don't have to
know it, don't have to care.  If he sends mail to an American
friend who has to use old-fashioned mail software, _that_ person
will see an address in the From: field that is uglier than
usual.  She shouldn't have to type that address herself,
however.  She can use the reply command and possibly "cut and
paste" the address into her address book.

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>

Non-ASCII Internet addresses? (Was: Comment on the draft MIME Part 1 document)