Re: Will case of non-ascii character be preserved by IDNA?


Jacob Palme <jpalme(_at_)dsv(_dot_)su(_dot_)se> wrote:

Will case of non-ascii character be preserved by IDNA?


Simon gave a perfect explanation of what will happen in practice.

Characters (ASCII and non-ASCII) are case-folded into lowercase by
Nameprep, iff the string contain _any_ non-ASCII character.


That is indeed what will happen if the sender applies ToASCII, which is
almost certainly what all existing implementations of IDNA do.

In theory, a conformant IDNA implementation need not apply ToASCII,
but could instead apply some other operation that returns equivalent
results.  The requirement in the IDNA spec is:

    Whenever a domain name is put into an IDN-unaware domain name slot,
    it MUST contain only ASCII characters.  Given an internationalized
    domain name (IDN), an equivalent domain name satisfying this
    requirement can be obtained by applying the ToASCII operation to
    each label...

In other words, ToASCII is sufficient, but is a little stricter than
necessary.  For example, given the IDN München.Net, a sender that simply
uses ToASCII will convert it to xn--mnchen-3ya.Net, but it would also
be permissible to send XN--MNCHEN-3YA.Net or Xn--MnChEn-3Ya.NeT or
xn--Mnchen-3ya.Net.  That last possibility is interesting because when
the receiver applies ToUnicode to it, the result will be München.Net.
Thus it would be trivial for senders to preserve the case of ASCII
letters without relying on any special cooperation from receivers.

Non-ASCII letters are another story.  Although the underlying Punycode
encoding is capable of carrying case information, IDNA makes no use of
that capability (IDNA is complex enough without it).  For a receiver
that uses ToUnicode to display IDNs (as all existing implementations
surely do), there is no way to make it output uppercase Ü.

In theory, receivers need not use ToUnicode, but could instead use some
other operation that returns equivalent results.  The requirement in the
IDNA spec is:

    ACE labels obtained from domain name slots SHOULD be hidden from
    users...  Given an internationalized domain name, an equivalent
    domain name containing no ACE labels can be obtained by applying the
    ToUnicode operation to each label.

Therefore, it is conceivable that someday a pair of operations
newToASCII and newToUnicode could be implemented that return results
equivalent to those of ToASCII and ToUnicode, but which work together to
preserve the case of non-ASCII letters.  For example, given the label
MÜNCHEN, newToASCII might return xN--MNCHEN-3yA, which a receiver using
newToUnicode would display as MÜNCHEN, and which a receiver using the
old ToUnicode would display as MüNCHEN.  All of these are equivalent
(IDNs are case-insensitive).  Working out the details of newToASCII and
newToUnicode would be non-trivial.

But it's not clear to what extent case preservation is demanded, or
even expected.  I've noticed that email addresses often get coerced to
all-caps, even the local part (in blatant disregard of the standards,
which say that local parts can be case sensitive).

AMC