Re: IDN (was Did anyone tell Microsoft yet?)


In <200205091908(_dot_)09943(_at_)sendmail(_dot_)mutz(_dot_)com> Marc Mutz 
<mutz(_at_)kde(_dot_)org> writes:

On Wednesday 08 May 2002 13:34, Charles Lindsey wrote:


Yes, I think on further reflection that you would not insist on too much
(or even any) canonicalization of the RFC 2047 stuff, but you would
require it to be decoded to an octet stream at the far end, and the local
part would be considered to match if the octet streams matched.

<snip>

But that breaks existing software that needs to compare local-parts for 
equality (think mailing list handling software). This breaking can only be 
avoided if the encoding has the property to always yield the same output 
octet-sequence for the same input Unicode sequence (modulo Unicode n11n and 
c13n).


Yes, I think you have to accept that agents that expect to process I18N
local-parts would have to have RFC 2047 decoding built into them as part
of their comparison process if this particular method is to be used.

<snip (1)>


2. The result of applying IDNA to some valid local-part is another
local-part. Who is to tell whether that was not the intended local-part in
the first place. So you would first need to restrict RFC 2822 local-parts
in some way.

But this is a problem that arises for every encoding that is representable in 
current local-part syntax. Rfc2047 also yields just another valid 
local-part[1] (altough less likely to confilct with existing local-part due 
to the many "broken" mailers that would display it decoded).


Yes, although the RFC 2047 syntax is bizarre enough that if you see
anything of the form =?...?...?...?= you can pretty safely assume it was
not intended to be taken literally (and lots of software around will
immediately decode it whether it should or not).

So that leads us to the inevitable conclusion that we are FORCED to change
the syntax of local-part in RFC 2822 before we can make progress on this.
Two possibities:

1. Restrict it by excluding some characters (which are then available to
indicate that some encoding has taken place). The simplest restriction
would be to declare
        local-part = token / quoted-string
which frees up '?' and '='. So either you then use RFC 2047 encoding, or
you use IDNA with perhaps '??' in the Nameprep. So our friend jürgen
becomes ??--jrgen-kva .

2. Allow some extra characters, which would only be allowed for encoded
local-parts. Trouble is they would have to be chosen from the 'specials',
and there are no obvious candidates.

So I think it has to be (1), which then leaves the choice of RFC 2047 or
IDNA open for further discussion (though the problem of uppercase letters
in IDNA would still need resolution).

So we are then back to the issue of whether it is safe to allow all and
sundry applications to indulge in Unicode normalization and lowercasing.
People seem determined to argue that we need this. If that is so, then I
still think it better to let the operating system do it. That would mean
having a relocateable library containing the Unicode tables permanently
built into or supplied with each operating system. There might be some
future in that if the Unicode people could publish it (complete with a
unicode.h file to establish its structure). Do we know whether they have
plans to do that?  I do know that they publish various tables, but only in
txt form AIUI.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5