--On Friday, February 15, 2013 16:48 -0800 Joe Touch
<touch(_at_)isi(_dot_)edu> wrote:
If any label were allowed, then why does IDN conversion go so
far out of its way to exclude particular strings, e.g., those
beginning/ending with '-' and encodes everything 0..7F into
a-z/0-9?
(I was focused on looking up A records given FQDNs)
Now you are asking a different question, although the answer for
A RRs generally is still the same. My apologies to those who
don't need the mini-tutorial that follows -- many of those of us
who work primarily in applications would need similar tutorials
to understand the reasons for some of the decisions in the work
of those who primarily operate at lower layers of the stack.
I recommend rereading the relevant sections of RFCs 1123 and
2181 but, briefly, the DNS doesn't impose any limitations other
than what will fit in an octet. However, many, perhaps most,
applications do impose their own rules and those rules usually
match what 1034/1035 call the "preferred syntax" -- a syntax
derived from popular applications at the time as those
specifications make clear. As one example with which I'm
painfully familiar, SMTP treats a domain name containing
characters outside the ASCII range as syntax violations and a
conforming implementation will never look up such a domain as
part of mail address resolution or routing. Consequently,
something like
Non-ASCII-String MX 0 some.domain.example.
is perfectly valid as far as the DNS is concerned but nonsense
as far as actual utility is concerned -- SMTP implementations
are the only users of MX RRs and no conforming SMTP
implementation will ever access such a record.
IDNA is a clever trick (or, from other perspectives, an ugly
hack) that accomplishes two main things:
-- It permits IDNs to be used with applications
including, e.g., SMTP, without changing _their_ syntax
rules because the labels stored in the DNS and
transmitted on the wire still conform to those
"preferred syntax" rules.
-- It warns applications and forces the additional
restrictions and processing that enable sensible
treatment of non-ASCII strings. As the most obvious
example, the case-insensitive matching that the DNS
specifies for ASCII strings is not defined by the DNS
for non-ASCII ones (and, indeed, becomes more
complicated and language or locale-sensitive for some
characters).
I don't believe the importance of the second was fully
appreciated when we got started on IDNA. To this day, people
who believe that IDNA can be replaced by simply placing Unicode
strings encoded in UTF-8 into the DNS tend to make proposals
that ignore those issues.
The additional exclusions of IDNA such as prohibition of most
symbols in the Unicode collection and restrictions on the
appearance of "--" in the third and fourth octets of labels
apply only to IDNA implementations and are intended to provide
an extended, but still relatively safe, version of the
historical "preferred syntax" and/or to protect the syntax for
signaling special coding in the (unlikely but not impossible)
event that a different one is needed for some purpose in the
future.
The bottom line is that none of these restrictions, including
the SMTP one and the IDNA ones, are a property or requirement of
the DNS.
best,
john