Re: Non-ASCII Internet addresses?

(1) I wrote myself (Thu, 29 Apr 93 21:41:45 +0200):

It's important that the "displayed" form of an address be identical to
the way you spell it when you type it in. ...


I agree with this but I don't see how it can be an argument
against my proposal.  There is no reason why mail software,
implementing the rules for representing non-ASCII characters,
can't be fully symmetric in the sense that all non-ASCII
characters that can be displayed can also be input by normal
keyboard methods.  On my Swedish keyboard there is a key for the
letter A WITH DIAERESIS.  To write an address containing this
letter (by means of my proposed encoding) I will press this key.
From the point of view of a user of a new mail program the
displayed address will be identical to the address he enters by
the keyboard.  In both cases he sees the non-ASCII characters.


I would like to clearify a point here:  Even in a system capable 
of displaying all characters of ISO 10646 for most users the UA 
should restrict the characters displayed in addresses to a subset 
which the user is familiar with, can recognise with no problems 
and knows how to input.  Most users in Sweden for example can 
handle in this sense a subset containing ASCII, the Swedish 
national letters and all Latin letters composed with acute, grave 
or circumflex accent, or tilde or diaeresis.  But say a Russian 
address containing Cyrillic letters will be difficult to use for 
an ordinary Swedish users and such addresses should be displayed 
by the UA in its fundamental ASCII form, even if the hardware and 
software has the capability to handle Cyrillic letters.

2) One thing that I'm uncertain about is if we can trust the whole 
Internet email system to preserve the case of ASCII letters in 
addresses.  RFC 822 requires this for the local-part:

     local-part  =  word *("." word)             ; uninterpreted
                                                 ; case-preserved


For the right-hand side of an address, RFC 1034 seems to 
at least anticipate a future extension requiring case 
preservation:

By convention, domain names can be stored with arbitrary case, but
domain name comparisons for all present domain functions are done in a
case-insensitive manner, assuming an ASCII character set, and a high
order zero bit.  This means that you are free to create a node with
label "A" or a node with label "a", but not both as brothers; you could
refer to either using "a" or "A".  When you receive a domain name or
label, you should preserve its case.  The rationale for this choice is
that we may someday need to add full binary domain names for new
services; existing services would not be changed.


3) For my encoding scheme to work it's important not only that
all characters outside ASCII can be encoded but also that they
can be encoded in only one way.  I have said earlier that all
characters on implementation level 3 of ISO 10646 should be
encodable.  This is possible but presupposes rather complex
rules to achieve unique representations.  A simpler way is to
only support level 2 of 10646.  In that case these rules would
be sufficient:

a) For ASCII characters, use ASCII representation.

b) For sequences of at least 3 non-ASCII characters from plan 00
   of 10646, use prefix representation.

c) For shorter sequences of non-ASCII characters from plan 00 of
   10646, use two-octet representation.

d) For other (sequences of) 10646 characters, use four-octet
   representation.

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>