Re: IDN security violation? Please comment

Bruce,

I'll try to respond to the issues and questions you raise, but
please note that the landscape here is strewn with dead horses
and that kicking them is not a particularly helpful or rewarding
activity.

--On Tuesday, 08 February, 2005 14:54 -0500 Bruce Lilly
<blilly(_at_)erols(_dot_)com> wrote:

 Date: 2005-02-08 08:39
 From: John C Klensin <john-ietf(_at_)jck(_dot_)com>

Well, it is a little worse because there are tools that make
detection of the YAH00.COM problem and its relatives pretty
easy and those tools are widely understood.  For example,
forcing those domain names to lower case makes them very
distinguishable (yahoo.com and yah00.com) are pretty clearly
different) and using fonts that make zeros and "o"s, ones and
"l"s, etc., clearly different helps a lot too.


On the other hand, using lower case won't help if the
"attacker" uses Greek omicron instead of Latin 'O'.


As I have said elsewhere, there are _many_ opportunities for
confusion here.  I assume that those who constructed this
particular example wanted to use a well-known phishing target.
The particular YAHOO example, as distinct from the paypal
example, came from a comment on the IETF list.  In both cases,
better (i.e, more difficult to detect) examples are possible,
especially with fewer or different constraints.  But that isn't
the point, is it?

With IDNs, the simple fact that there are tens of thousands of
characters with which one can try to create confusion, rather
than 37 or so, means there are going to be more
"opportunities". What is more important, perhaps, is that we
just don't have the experience with the design of user
interfaces that make problem detection easy.   For example,
the moment I touched the Firefox cursor to the examples at
the examples at
http://www.shmoo.com/idn/, I realized that I really wanted to
see the punycode in the status line as well as the "native
character" rendering.


I assume that rather than "punycode" (which is an encoding
scheme used for *part* of IDNs) you mean the on-the-wire
dot-separated DNS name components consisting solely of
letters, digits, and hyphens. If so, I have two comments:
1. That's not likely to help, as humans aren't very adept at
   decoding IDNs on sight, and distinguishing one IDN from
   another on sight isn't something that one would expect
   casual users to be able to do; all IDNs tend to look like
   "xn--blah", and many casual users lack any of concern,
   interest, inclination, or patience to look beyond "xn".


I think I said that, although in different language :-(.   If
one is expecting an ASCII string, then seeing a punycode label
instead would be a strong tip-off that there is a problem.   If
one is expecting an IDN string, then seeing a punycode-label in
the string that is presented would be a far less useful hint.

2. That would defeat the intent behind IDN, which is to present
   what the on-the-wire DNS name represents rather than that
   on-the-wire DNS name.


Let me try to say this carefully.  The "intent behind IDN" is to
permit people to use local languages and characters in what
appears to them to be DNS labels.  Until and unless every one of
us has a keyboard that permits easy input of every Unicode
character (and I don't mean by knowing and typing in it code
point position) and the knowledge and character
perception/discrimination ability needed to use such a magical
keyboard, there will always be some likelihood of reversion to
punycode -- not for the local language and characters, but for
presentation-form FQDNs that contain characters from very far
away.  I hope that doesn't happen very often.  I expect that it
will happen less often as time goes on.  But I don't expect we
will reach zero, at least within the next decade or two. 

Whether or not you expect that, there is a huge difference
between seeing native-character text in the display of the DNS
name or URI/IRI on the web page --where I would hope to _never_
see punycode-- and what can optionally be turned on in a status
line.  But note that we are talking about user interface issues
here, not standards.  If a user wants that status line
information, let her have it.  If he doesn't, so be it.  And, if
a browser doesn't offer the needed flexibility to give users
what they want, I presume that users who care enough will find
other browsers.

As another piece of this, my own guess --and I want to stress
that it is just a guess, not a proposal for a standard or
requirement-- is that whatever mechanism is used to copy DNS
names or URIs from one place to another will acquire separate
"copy native characters" and "map to punycode and copy that"
options.  I'd expect similar options for IRIs, i.e., "copy IRI"
and "force into URI format with escaped characters and copy
that".  Why?  Because, if I am a sensible and cautious user of
Lower Slobbovian script and I'm sending an IDN or IRI on paper
to a user who is not familiar with that script, I'm going to
send the punycode or URI form along as a safety precaution.
YMMD, of course, and you might plausibly prefer to let only
people who know and can read and type your script get to your
content.   But we should both, IMO, have the options of doing
whatever meets our needs.

I'd add that one approach to the problem would be to undo the
encoding, query DNS to get an IP address, then present that
(possibly with associated SOA information and reverse domain
name lookup); numeric IP addresses aren't going to be mistaken
for some random collection of "characters" (in the Unicode
sense) or non-numeric glyphs.


In the discussion above, you made the observation that end users
are not likely to be good at decoding punycode-containing IDNs
on sight.   We agree.   Do you think those users are going to be
better at looking at an IP address and figuring out if it
belongs to whomever they think it belongs to?  If your answer is
"yes", does it change when you think about IPv6: much longer
addresses, multiple addresses per host, etc.   As you are
thinking about this, note that the world's most popular
operating system doesn't support a "dig" or "nslookup" function
in most of its versions/ variations.

Where it is possible to read the characters and type them back
in, the easiest protection against this type of attack is
extremely well-known from the ASCII-only world, and that is to
type in the URI or IRI one thinks one sees, rather than clicking
on a link.  Now, realistically, no one is going to do that,
especially with complicated URLs, unless they have reason to be
suspicious (of course, the seriously security-paranoid _always_
have reason to be suspicious).  Nor, again, absent suspicion, is
anyone going to attempt reverse mappings or traceroutes on every
domain name.

But, again, let's take this up a level of abstraction and
remember that we are talking about user interfaces -- an area in
which IETF competence has been shown to be, well, limited.
_Nothing_ is going to completely identify and accurately
diagnose every possible case of phishing, fraud, misleading
names, evil programs catching typographical errors, and so on.
That statement is true whether we are talking about IDNs or
about an IDN-free environment. What we should hope is that those
who provide applications and user interfaces will provide their
users and customers with a sufficient range of options to detect
what can reasonably be detected, and create the right level of
suspicion, with an acceptable level of ugliness, to produce
warnings where appropriate, and to give users the appropriate
tools for checking things out that seem dangerous.  We can hope
that the marketplace rewards applications and
applications-writers who do a good job of that.  Those of us who
are a little bloody-minded will probably also hope that natural
selection will appropriately reward those lusers who turn off
all of the checking or select applications that don't have it
because those applications provide a more elegant user
experience.  But nothing the IETF can or will do is going to
help with any of that.

Regarding suggestions that some authority or authorities
should enact some restrictions intended to prevent such
misleading names; in the absence of a globally-recognized
and effective enforcement mechanism, such measures are
meaningless.  And I would hasten to add that a Big
Brother-esque world that such things would lead to would
be highly undesirable (at least by those of us who have
no interest in being "Big Brother").


See "dead horse" above.  The IETF decided to throw whatever
parts of this it could even theoretically control over the wall
and over the wall is probably where it belongs.  However, you
should be aware already that many, perhaps most, domains (at any
level of the tree) have created and enforce the names they are
willing to register and it has pretty much always been that way.
Like the DNS, many of those decisions are extremely distributed:
if you don't like the rules of one domain, you are free to find
another one whose rules you do like, or to register something
somewhere and then make up your own rules for its subdomains.
Other preferences and restrictions get tied up with trademarks
and enforced by lawyers and neither your religious convictions
nor mine are likely to change that very much.  

IDNs, again, make some things more complicated. A number of
entities have found that, for various reasons, rather aggressive
registration restrictions, sometimes ones that bind groups of
names together, are in the interests of the populations they
serve -- that is what, e.g., RFC 3743 is about.  Others haven't.
You pay your money and you make your choices.

Just as with the YAH00.COM case, no single measure is going to
"fix" or prevent the various problems we can encounter with
IDNs.  But a combination of some thinking, good policies,
adapting tools on the basis of experience, and the level of
user vigilance that seems a requirement for being attached to
the Internet at all these days ought to permit us to use IDNs
at risk comparable to that for LDH-style ASCII names.

I suspect the problem is intractable, and is rooted in the
(IMO ill-conceived) conflation of public DNS "names" (meaning
keywords in the RFC 1958 / RFC 2277 sense) with natural
language / legal "names" (proper names, trademarks, etc.).
[And I agree with Ohta-san's statement that we are observing
the inevitable consequences; not only of internationalization,
but of the underlying conflation of protocol elements with
natural language names.]


You don't need to convince me.  See RFC 3467 and, to a lesser
degree, RFC 3071.  Or you might try to dig out a copy of
draft-klensin-dns-search-06.txt, which I hope to find time to
get back to some day.    But the marketplace and, following
rather than leading it, the IETF, made a different set of
decisions.  Much as I might have wished it otherwise, DNS names
stopped being purely protocol elements the first time it
occurred to someone to put a URL on the side of a bus or in an
advertisement with a popular audience.  That particular genie
isn't going back in the bottle (again, much as some of us might
wish otherwise) and no amount of revising statements are
architecture is going to make any difference.

I would also like to take this opportunity to repeat an earlier
suggestion, viz. that the IAB should update RFC 1958 and give
that update some status more substantive than "Informational".
In particular, such an update should clearly state that
protocol elements are simply that; any resemblance to natural
language names, places, or things is purely coincidental.


Sure.  Who do you think would pay attention to such a statement?

I can only hope that our colleagues at Mozilla will rapidly
supercede their apparent advice to disable IDNs --advice that
seems to me to be equivalent to "you should be happy just
using English"


I don't think that is the equivalent; letters, digits, and
hyphens are not peculiar to English, nor are domain name
components tied to any language -- they are simply protocol
elements that identify places in a hierarchical database
which maps to a database of values associated with a
hierarchical assemblage of such elements.

IMO, advice to disable IDNs is good advice; no
"internationalization" of protocol elements was necessary in
the first place, and the mechanism -- like a number of other
mechanisms in URL syntax (e.g. user/password delimiters in the
"authority" section, %-encodings) which have long been used to
obfuscate or mislead -- leads to predictable consequences.  I
note in passing that other browser suppliers have disabled
similar mechanisms because of concerns about the sort of issue
under discussion.


Like it or not, there is a large population in the real world
who are not interested in that argument or position.  There are
even folks who are technically sophisticated enough to
understand and accept your argument about protocol identifiers
who nonetheless believe that they should be able to identify
objects with names or acronyms that have mnemonic value in their
languages and character sets.  Personally, I have a lot of
trouble disagreeing with the latter group, and have learned that
disagreeing with the former one doesn't get me anywhere.

best,
   john


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf