Re: IDN security violation? Please comment

 Date: 2005-02-08 19:57
 From: John C Klensin <john-ietf(_at_)jck(_dot_)com>

I'll try to respond to the issues and questions you raise, but
please note that the landscape here is strewn with dead horses
and that kicking them is not a particularly helpful or rewarding
activity.


Noted. Ditto for hand-wringing.

In both cases,
better (i.e, more difficult to detect) examples are possible,
especially with fewer or different constraints.  But that isn't
the point, is it?


My point was that case conversion doesn't go very far as a
means of detecting such things.

If
one is expecting an ASCII string, then seeing a punycode label
instead would be a strong tip-off that there is a problem.


I'm still not sure that I understand exactly what you mean by
"a punycode label"; is that the DNS label (comprised of LDH),
or some hexadecimal codes, or something else?  [LDH characters
are (a subset of) ASCII...]

Let me try to say this carefully.  The "intent behind IDN" is to
permit people to use local languages and characters in what
appears to them to be DNS labels.


OK, I'll go around that minefield for the moment.

Until and unless every one of 
us has a keyboard that permits easy input of every Unicode
character (and I don't mean by knowing and typing in it code
point position) and the knowledge and character
perception/discrimination ability needed to use such a magical
keyboard [...]


A couple of observations:

1. I have in mind a keyboard on a certain device which has
   support for protocols which use domain names (HTTP, SMTP/
   Internet Message Format, VPIM).  It has a keyboard which
   is at best inconvenient for entry of ASCII text. Unicode
   "text" (see below for an explanation of the scare quotes)
   is unthinkable.  That device is a cell phone.  I have in
   mind another device with a keyboard (a PDA). It also has
   support for protocols which use domain names (all of the
   above plus VNC, FTP, TELNET, SSH, and probably a few
   others that I don't recall). The keyboard has no question
   mark key or escape key, and no convenient way to enter
   those characters short of menus etc. in specific
   applications.  Unicode, likewise, is unthinkable.  I am
   once again reminded of RFC 1958 (section 3.1); clearly
   somebody has lost sight of the issues discussed therein --
   huge Unicode equivalence/normalization/whatever tables
   simply won't fit in some devices.
2. Once upon a time, Unicode had Design Principles; I quote
   from Table 2.1 as it appeared in early Unicode Versions:
   "Sixteen-bit character codes | Unicode characters have a
   width of 16 bits."
   "Plain text | The Unicode Standard encodes plain text."
   The accompanying text went on: "Graphologies unrelated to
   text, such as musical and dance notations, are outside the
   scope of the Unicode Standard."  All of which sounded
   promising.  Well, those design principles have long been
   abandoned.  More recent versions of Unicode have added --
   you guessed it -- musical notations, etc.   Unicode
   adhering to the early design principles might have had a
   chance of fitting into small, low-power, mobile devices.
   But with expansion of the code points by several orders
   of magnitude that's impractical.  Not to mention the
   problems with incompatible versions (and I'm not referring
   to "the Korean mess" of RFCs 2279/2781).

if I am a sensible and cautious user of
Lower Slobbovian script and I'm sending an IDN or IRI on paper
to a user who is not familiar with that script, I'm going to
send the punycode or URI form along as a safety precaution.
YMMD, of course, and you might plausibly prefer to let only
people who know and can read and type your script get to your
content.


Or adhere to the design principles mentioned in RFC 2396
section 1.5.

I'd add that one approach to the problem would be to undo the
encoding, query DNS to get an IP address, then present that
(possibly with associated SOA information and reverse domain
name lookup); numeric IP addresses aren't going to be mistaken
for some random collection of "characters" (in the Unicode
sense) or non-numeric glyphs.


In the discussion above, you made the observation that end users
are not likely to be good at decoding punycode-containing IDNs
on sight.   We agree.   Do you think those users are going to be
better at looking at an IP address and figuring out if it
belongs to whomever they think it belongs to?


No, which is why I mentioned SOA information (reverse lookup
of the IP to name mapping alone may not work in some cases
(DHCP, etc.) and won't help in others "yah00.com" -> IP ->
"yah00.com" doesn't help much).  On the other hand, if SOA
information indicates that "yah00.com" is registered to
somebody in China, that's a big indication that something is
fishy.  Of course. registrars will need to be more vigilant
about ensuring that SOA information, whois records, etc. are
correct [and, yes, I am aware that some people intentionally
provide falsified information].

As you are
thinking about this, note that the world's most popular
operating system doesn't support a "dig" or "nslookup" function
in most of its versions/ variations. [...]


The idea is that the application (e.g. browser) would do the
lookups and display (e.g. in a status area, or perhaps something
like the way some browsers display certificate/cookie information)
the relevant information.

However, you
should be aware already that many, perhaps most, domains (at any
level of the tree) have created and enforce the names they are
willing to register

[...]

if you don't like the rules of one domain, you are free to [...]
make up your own rules for its subdomains.


And therein lies the gaping loophole in such schemes.

I would also like to take this opportunity to repeat an earlier
suggestion, viz. that the IAB should update RFC 1958 and give
that update some status more substantive than "Informational".
In particular, such an update should clearly state that
protocol elements are simply that; any resemblance to natural
language names, places, or things is purely coincidental.


Sure.  Who do you think would pay attention to such a statement?


Those who care about doing the right thing.  I believe that
there are developers in that category, but lacking a clear and
authoritative statement of principles, many are easily mislead
by misinformation or simply assumptions made in the absence of
facts.

Now you have a valid point that in many respects it's too late
regarding this specific instance of this particular issue.  But
RFC 1958 covers a lot of ground, and is probably overdue for an
update and some reinforcement  I am dismayed at the poor quality
of engineering behind some recent proposals, and failing some
clear up-to-date architectural guidelines, I suspect that matters
will get worse.

Like it or not, there is a large population in the real world
who are not interested in that argument or position.  There are
even folks who are technically sophisticated enough to
understand and accept your argument about protocol identifiers
who nonetheless believe that they should be able to identify
objects with names or acronyms that have mnemonic value in their
languages and character sets.


There's an old maxim: "be careful what you ask for; you might
get it".  As already noted, the sort of problem under discussion
was not only predictable, it was predicted as the inevitable
result if IDNs.  So those folks got what they wanted, and the
problems that go hand-in-hand with it.  Unfortunately, everybody
else also suffers from the problems.

_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf