ietf-822
[Top] [All Lists]

Re: Dual names, IDN and ASCII, in e-mail addresses?

2003-10-08 05:57:52

Adam M. Costello wrote:
Bruce Lilly <blilly(_at_)verizon(_dot_)net> wrote:

Such a tag should not be separate from the IDN; it should be part of
the IDN so that it travels with and is processed with the text which
is presented to a human (using a client with appropriate IDN support).
One wouldn't want the language tag to be stripped or otherwise mangled
in transit.


But the "transit" for domain names is often the telephone, billboards,
etc.  It's hard to see how invisible language tags would survive such
transit.

The same way URLs with non-ASCII characters do so (see RFC 2396 section 1.5);
viz. via a compatible encoding which provides for unambiguous rendering.

I see no reason why some text sequence in different languages should
not encode to different DNS names, just as "boot" in German and
"boot" in English refer to two very different things (indeed, there
are differences between en-us and en-uk) -- in fact it seems highly
desirable that they *should* encode to different DNS names.


So josé.com and josé.com should be two different domain names?  (One is
Spanish and the other is Portuguese.)

Technically, they would be presentation forms of two distinct domain
names; domain names use characters from a *subset* of US-ASCII which is
comprised of letters, digits, and the hyphen character (LDH) [I suspect
that you know that, but your use of "ASCII domain names" indicates that
you may have temporarily forgotten that -- '$' is a perfectly valid ASCII
character, but it cannot appear as part of a domain name]. "josé.com" has
never been and is not now (and is unlikely to become) a domain name because
one of its dot-separated components does not meet the requirements for a
subdomain; it is not comprised solely of LDH characters.

Domain names do not always mimic proper names; often a domain name refers
to subject matter, and as noted the same sequence of characters often
means very different things in different languages (and often --
sometimes even when meaning is unchanged -- is spoken differently
(which is why language tagging is essential for presentation via text-to-
speech)).

If I see josé.com on paper, how
do I know which of those two domains it is?  And even if I know, how do
I type the language tag into my browser?

One will appear as something like qz-jos-blurfl and the other as something
like qz-jos-grimble. And either can be typed, even in a browser on a PDA or
similar device which has no accented character input capability.  Depending
on exactly how the tagging is implemented [it should be clear that I don't
care much for Unicode 3.1 / RFC 2482] one might be printable as josé(es).com
and the other as josé(pt-br).com, for example (probably not ideal syntax, but
hopefully you get the idea).  However, in that case, printing or typing may
be issues (identification, normalization, I/O, etc.).  For non-ephemeral
presentation, one would be well-advised to use the unambiguous on-the-wire
form.

This is why I think language tagging, if used at all, would need to be
non-essential markup, which could be retained when feasible, but could
also be lost with no worse result than a degradation in the quality of
presentation, not a failure to find the domain.

Something like qz-pt-br--jos-farkle, i.e. placing the language tag outside
of the nameprep+punycode process might be feasible, but discarding the tag
discards information.

'Presentation' and 'find[ing] the domain' are distinct operations. Some
form of presentation might be able to get away with hiding some information
which may be essential for other forms of presentation.  However, discarding
such information is incompatible with use in a protocol. Software encountering
"qz-jos-blurfl" as a domain name had better use that when passing the name to
DNS, regardless of what is used for non-protocol (i.e. human) presentation.

Domain name components are comprised of characters from LDH -- 37 elements,
and can have length of up to 63 of those characters.  That's 37^63 distinct
components of length 63. There are also 37^62 possible subdomains of length
62, etc. Total domain name length can be up to 255 (LDH) characters.  Clearly
there is no shortage of domain names -- obviously the motivation for IDN isn't
a need to broaden the characters beyond LDH because of any shortage of domain
names.  The impetus is to be able to provide for presentation of text (in the
sense used in RFC 2277) to humans; text in some language.