ietf-822
[Top] [All Lists]

Re: Dual names, IDN and ASCII, in e-mail addresses?

2003-10-08 15:01:58

Bruce Lilly <blilly(_at_)verizon(_dot_)net> wrote:

But the "transit" for domain names is often the telephone,
billboards, etc.  It's hard to see how invisible language tags would
survive such transit.

The same way URLs with non-ASCII characters do so (see RFC 2396
section 1.5);

Section 1.5 makes it quite clear that URIs don't contain non-ASCII
characters.  URIs are composed only of ASCII characters.  There is a
recommended way for those characters to be constructed (in a reversible
way) from arbitrary octets, and a recommended way for those octets to
be constructed (in a reversible way) from non-ASCII text.  But URIs
themselves are ASCII, and there is no non-ASCII presentation form; the
user always sees ASCII.  The reversible constructions I just mentioned
may be useful for machines (as in long URIs generated and consumed by
HTML forms), but they're not helpful for making URIs easier to remember
and type for humans who are unfamiliar with ASCII characters.  That is
why people are working on IRIs, which can actually contain non-ASCII
characters.

So josé.com and josé.com should be two different domain names?  (One
is Spanish and the other is Portugese.)

Technically, they would be presentation forms of two distinct domain
names;

The impetus [for IDN] is to be able to provide for presentation of
text (in the sense used in RFC 2277) to humans; text in some language.

The view of IDNA and its designers is that internationalized domain
names are not intended to be mere presentation forms.  They are meant to
be an extension of the concept of domain name; something exactly like
a domain name, except that it can truly contain non-ASCII characters.
JOSÉ, josé, XN--JOS-DMA, and xn--jos-dma are equally real and legitimate
forms of the same internationalized label; they are all first-class
citizens of the IDN namespace.  Existing protocols (like DNS, SMTP,
message headers) use only the ASCII subset of the IDN namespace, but
that is sufficient to include all the labels, because of the equivalence
relation.  New protocols are welcome to use the non-ASCII forms as
protocol elements.  Applications are encouraged to use the non-ASCII
forms as user-interface elements for both input and output.

"josé.com" has never been and is not now (and is unlikely to become) a
domain name because one of its dot-separated components does not meet
the requirements for a subdomain;

Sorry, I was using "domain name" as shorthand for "internationalized
domain name" and using "ASCII domain name" to refer to the traditional
concept of "domain name".  That's confusing during the transition, but I
think that's going to become the common usage.

domain names use characters from a *subset* of US-ASCII which is
comprised of letters, digits, and the hyphen character (LDH) [I
suspect that you know that, but your use of "ASCII domain names"
indicates that you may have temporarily forgotten that -- '$' is a
perfectly valid ASCII character, but it cannot appear as part of a
domain name].

Domain names can contain any ASCII characters.  The DNS spec defines a
"preferred syntax" that is encouraged but not required.  STD-3 defines a
syntax for host names.

The DNS spec talks about how to put dots inside labels, and other
standards-track RFCs use labels containing underscores and slashes for
domain names that are not host names.

If I see josé.com on paper, how do I know which of those two domains
it is?  And even if I know, how do I type the language tag into my
browser?

One will appear as something like qz-jos-blurfl and the other as
something like qz-jos-grimble.  And either can be typed, even in a
browser on a PDA or similar device which has no accented character
input capability.

The whole point of IDNs is to allow users to see and type familiar
characters, not ASCII garbage.  An application that supports IDNs and
accented characters should allow the user to type josé.com, not require
the user to type qz-jos-blurfl or qz-jos-grimble.  No one would want a
domain name that always had to be typed that way.

AMC