Re: [idn] Re: 7 bits forever!

    Date:        Sat, 06 Apr 2002 11:21:59 -0500
    From:        John C Klensin <klensin(_at_)jck(_dot_)com>
    Message-ID:  <84281257(_dot_)1018092119(_at_)localhost>

  | Hmm.  I'd argue that the other existing protocols count[ed] --
  | FTP and Telnet connection specifications,

Neither telnet nor FTP transmit domain names (other than as uninterpretable
data, as in "Welcome to hostname") - the only relevance of domain name
rules there is to the OS interface (which is perhaps what you mean by
connection specifications) which isn't part of the spec.   It is entirely
possible for any OS to allow just about anything as the name spec, just
as long as it eventually gets mapped to the correct IP address.   Eg: when
I want to fetch an RFC, I type "ftp rfc" and a local mapping mechanism
converts "rfc" into a suitable IP address.

  | two-level specs in
  | Finger and Whois, etc., but I think this isn't important.

While the protocol is ancient, the first finger spec is in 1194 (it
post-dates 1034...).  But yes, finger would have been one use beyond
e-mail - if it had been specified just what was legal and what wasn't.
I kind of suspect that it is more "anything after @ to end of line is passed
to the hostname lookup function" than anything more precise in reality.

Whois is a database lookup protocol, and looks up whatever is stored in the
database, it has nothing to do with the DNS at all (it happens that the
most popular database is that used to generate DNS zone files - but whois
queries are database keys (or searches), not domain names).

  | As I said, trying to take a shortcut and to provide some
  | additional evidence of intent/logical consistency in case the
  | sections that I think are important are read as contradicting
  | those that you cite above.

I don't, the sections quoted are absolutely unambiguous.  Anything
that could contradict them would have to be equally as clear, and there is 
nothing.

  | I don't read the spec as
  | contradictory, but as specifying three sets of rules: LDH
  | (recommended), ASCII (required for "existing" RRs), binary
  | (possible future extensions in new RRs).

But the LDH isn't a rule, it is very clear that it is a suggestion (as
far as the DNS is concerned).   It is a rule for other protocols that
use domain names (particularly 821/822 and their replacements).

  | More on this below (I'm going to paste in an earlier analysis,
  | with the text cited, rather than screwing it up by trying to
  | reconstruct) but I don't see anything in the text that says "if
  | you see the high bit set, you can assume it is binary and other
  | rules apply; if the high bit is zero, then it is ASCII and needs
  | case-independent comparison".

Except there aren't other rules that apply, so it doesn't say exactly that.
What it says is that if it is an ascii char, case independence applies,
otherwise it doesn't, and (it actually says) ascii chars have the top bit 0
(written as "zero parity" or something like that...)

  | It seems to me that a statement
  | of that general nature would be needed to justify your assertion
  | above.  I note with interest that even 2181 doesn't seem to
  | include such a statement as a clarification of what is an "ascii
  | label" and what is a "binary label".

No, aside from the (much newer) binary labels proposal, which is an
entirely different thing, there is no difference, and cannot be.  There
is exactly one kind of label.   Some of the octets in the label may be
ascii chars, in which case the case independence applies.  Others not.

  | From which I assume that 2181 did not intend to change anything
  | about 1034/1035 in this area and that its approval by the IESG
  | was based on that assumption.

No, it didn't intend to change anything about 1034/5 in this area.
That is, 1034/5 were already perfectly clear that any 8 bit value is
permitted in a DNS label, and all that was ever needed was to reinforce
that.   That's what 2181 does.

[Aside: The environment for this part of 2181 was more that one well
known implementation of the DNS had taken to either prohibiting, or
issuing warnings about (depending on config options, which defaulted to
error for primary, and warning for secondary) names that weren't strictly
LDH.  That is, other prefectly natural ascii chars (like _ and %) were
being rejected.   Actually doing something when the top bit of an octet
in the label is set wasn't the issue - but reiterating that it was
permitted (and hence, DNS implementations must not complain about it)
was important.   Whether anyone sane would use such a thing is a
whole different issue.]

  | I think what I'm suggesting is that the valid content of a given
  | label depends on the RR type (and Class) with which it is
  | associated.  One can question the wisdom of that in retrospect,
  | but that it what the specification says.

No, that isn't what it says or means.   You're not reading it rationally.

Well, that may be for classes, no-one really knows what different classes
mean, anything at all might be possible, so let's skip that part for now.

But if what is permitted as the label of a domain name depends upon what
RR type is stored, how could that rationally work?  Unless you were to
restrict that to only the leaf label, which wouldn't be very useful to
allow the kind of arbitrary data that 1034/5 clearly envisage some day being 
stored in the DNS.  Without that restriction nodes in the DNS that have
NS records, and SOA records, would need to allow your "binary" labels,
as well as others.   And if they do, why not just everything?

No, what 1034/5 are saying is that users need to choose domain names that
meet the requirements of the applications storing the data.  At the time,
essentially everything wanted LDH, as that's all that was permitted in
the hosts file that was being replaced.   However, future applications
manipulating other kinds of names and data were clearly seen as something
for the future, and for them, the domain name to choose might not need
to be restricted.

But in any case, the DNS was already designed to cope (or at least it seemed
that way) - any random 8 bit data can be a label.

  | Of course, to support "case insensitivity for ASCII only", it
  | would be nice to have an algorithmic rule for identifying ASCII.
  | But binary labels can, in principle, have octets with the high
  | bit clear, or even all octets with the high bit clear.

No, you're trying to create two different kinds of labels.  There's no
way to do that, there's just one.   All labels (again leaving aside the
binary labels that were created later, which were entirely different)
use the same syntax - 6 bits of length, and 1 to 63 arbitrary octets.
There aren't ascii labels and binary labels, just labels that sometimes
contain some ascii characters (or often contain all ascii characters).

  | And one
  | does not want to apply case-insensitivity matching to binary
  | labels, not matter how they are structured.

It might not be what one wants, but it is what the DNS does.  If
the value of an octet is between 0x41 and 0x5a then it compares equal
to the corresponding octet with value between 0x61 and 0x7a.
Otherwise all octets are distinct.   And there's just one label type.

  | So, I believe, in logic, that one needs to know,

If one were designing this now, and attempting to make a rational design,
one may indeed want to know that.   But that isn't what was done, and you
simply cannot attempt to retrofit desires from today onto the specification
of yesterday.   It is what it says it is, no more and no less.

  | But the transition from case insensitive
  | comparison to case sensitive (or binary) comparison would be a
  | very interesting exercise.

yes...   But perhaps never attempting to create it for anything other
than ascii would give insights onto what is needed to allow it to be
deleted in ascii as well - someday, maybe.

  | Except that some of those "other ways" may result in octets with
  | the high bit clear that do not represent ASCII characters
  | (assuming, as you do, that 1034/1035 require case insensitive
  | comparison for ASCII only).

That simply isn't possible as the DNS is spec'd.  If the high bit is
clear it *is* to be interpreted as ascii.   There is no option, however
much you might want to have 0x41 and 0x61 as distinct "binary" labels,
you simply cannot.  And it doesn't matter what the RR type is, you cannot.
(Perhaps in some other class, but even that is "probably not possible").

  | In other words, the DNS needs to
  | know something about the encoding in order to know when to apply
  | case insensitive comparison (and, potentially, how to do it).

No, you're inventing what you think it should need - it doesn't need
anything of the kind, it simply does what the spec says it does.

  | I think this is correct.   While I haven't done the analysis, my
  | intuition tells me that, if we are going to go down this path on
  | the server side, we may have big problems potentially-recursive
  | RRs like DNAME and NAPTR, but that is a separate problem.  I
  | hope.

Perhaps - though if anyone took the DNS seriously, about only ever storing
names in the DNS in their one true case, it wouldn't be an issue at all.

That is, in the root zone, it is COM that exists, not com - so everywhere
else in the DNS, when the (TLD) COM is meant, it should be entered as COM,
never as com (nor Com nor any other variant).   If someone does a lookup
they can give any case, and the matching rules find it - but everyone is
supposed to enter data using the one defined case for the name, as created
by the owner of the name (so it is Berkeley.EDU or should be, not
BERKELEY.EDU and not berkeley.edu ...)

Of course, that rules was ignored form day 1 (the EDU zone file persists
in storing BERKELEY.EDU I believe).   (However, you may note that you
essentially always see my e-mail address as "@munnari.OZ.AU" because that's
its representation in the DNS, except in places I don't control).

  | There we probably disagree -- I suggest that the text is at
  | least ambiguous and might require it.  But, at some level, it
  | isn't important, because the text clearly prohibits non-ASCII
  | labels in "existing" RRs.  See below.

No, it doesn't prohibit them at all, it says that they probably should
not be used.

  | >From RFC1034, section 3.1

  | That statement is presumably part of your justification for
  | assuming that all bets are off if the high order bit is on.

Yes.  The "someday we may need to add" is the only text in either 1034 or 5
which really supports your suggested text, and I claim it means "that
someday people might need to actually use", and that "full binary labels"
just means "labels using the full range of characters" not some new
kind of label.

  | I'm inclined to read "additions beyond current usage" as
  | implying new RRs or new Classes, you are inclined to read it as
  | having octets with the high bit on appear in existing RRs.

I read it as whenever an application want to be able to use more than
just ascii labels, whatever the RR type.   Eg: if someone had decided
that X.400 should replace 821/822, but run over the internet, using domain
names, then A records (and presumably MX records) would still be needed,
but there would be no LDH rule to comply with.  Names used for such an
application would be whatever it specified as legal.   The DNS is unchanged.

  | It seems to me that this is at least a bit ambiguous, rather than
  | crystal-clear in the latter direction.

Even if so, the other words that permit any 8 bit value in a label
are crystal clear - nothing that is "a bit ambiguous" can possibly
be treated as overriding what is very clearly stated.

  | More important, it
  | appears to me to make a clear (and necessary) distinction
  | between "character strings" and "full binary octet capabilities"
  | in the DNS, to require case-insensitive comparison only for the
  | former, and hence to require that one be able to tell the
  | difference unambiguously.

If there was such a distinction, then yes, but there isn't.  That's
not the way the spec was written, and not the way it is implemented.
There are just labels, the case insensitive rule is applied individually
to each octet, not to whole labels.

  | But the first part of this does say "For all parts of the DNS
  | that are part of the official protocol, all comparisons between
  | character strings ...  are done in a case-insensitive manner."
  | To emphasize, that is "all parts" and "all comparisons", not
  | "unless you happen to find the high bit turned on".

Sure, no problem with that.   But what's there when the high bit is on
isn't ascii, and so has no "case" to be sensitive to.

  | So, in the
  | absence of some standards-track document that changes the
  | comparison rule -- either for new RRs or retroactively for
  | existing ones -- it seems to me that we are stuck with it.

Yes, but we're stuck with what it actually says, not what you're prefer
that it had said.

  | And, if after going through this, you find that we are still
  | reading the text differently, I suggest that 2181 probably need
  | updating to clarify how one of those "any binary string" labels
  | are to be interpreted when they appear in queries that require
  | case-insensitive matching.

We could update 2181 again, but this part was really though almost 
unnecessary even when 2181 was written, if that well known implementation
hadn't started doing absurd things, this part of 2181 would never have been
there, as this part of 1034/5 was never really considered to be unclear
in the slightest.   And note: the implementation didn't change because of
any lack of clarity about the DNS spec, but as a means of protecting hosts
from lots of broken implementations that did "bad" things if presented
with suitably crafted DNS responses (partitularly to PTR lookups).
That is buggy implementations made bad assumptions, and that was "fixed"
by having the DNS filter the "bad" data for the application.

There was never a serious argument that the DNS actually specified this.

kre

ps: I have no real issues with what Eric Hall wrote in his replies, but
I need to read them again, but have no time right now.