Re: [idn] Re: 7 bits forever!

--On Saturday, 06 April, 2002 18:44 +0700 Robert Elz
<kre(_at_)munnari(_dot_)OZ(_dot_)AU> wrote:

    Date:        Fri, 05 Apr 2002 14:41:53 -0500
    From:        John C Klensin <klensin(_at_)jck(_dot_)com>
    Message-ID:  <9863660(_dot_)1018017713(_at_)localhost>

I really hoped to be able to avoid having to do this, yet
again...


And I apologize for taking a shortcut because I didn't want to
take the time to pull out the text again.


  | As I read them, what 1034 and 1035 say is that the DNS can
  | accomodate any octets, but that [at least then]
  | currently-specified RRs are restricted to ASCII.

Sorry John, I can't fathom how you could possibly reach that
conclusion from what is in 1034 & 1035.


See below.


Eg: from 1035 (section 3.1) ...

      Although labels can contain any 8 bit values in octets that
make up a     label, it is strongly recommended that labels
follow the preferred  syntax described elsewhere in this memo,
which is compatible with      existing host naming conventions.

How much clearer do you want it?


Yes, that is one of the "LDH rules are a good practice" sections
I was referring to.  And

1034 is less clear, but (section 3.5 --- note its title)

  3.5. Preferred name syntax

  The DNS specifications attempt to be as general as possible
in the rules   for constructing domain names.  The idea is
that the name of any   existing object can be expressed as a
domain name with minimal changes.   However, when assigning a
domain name for an object, the prudent user   will select a
name which satisfies both the rules of the domain system   and
any existing rules for the object, whether these rules are
published   or implied by existing programs.

"The prudent user will select" ...  that is, this is a damn
good idea, but you don't have to do it if you know what you're
doing.


And that is the other one.  We are in complete agreement about
this.

  | The LDH rule is a good ("best"?) practices one.

It is required (as updated) if the domain name is to be used
in an e-mail header (which back then, was almost the only
other formalised place that domain names appeared - other than
that was all OS specific command/arg stuff).


Hmm.  I'd argue that the other existing protocols count[ed] --
FTP and Telnet connection specifications, two-level specs in
Finger and Whois, etc., but I think this isn't important.

  | But the ASCII rule is a firm requirement.

No it isn't, there is nothing at all which says that.


See below.

  | For evidence of this, temporarily ignore the text

Hmm - ignore what is written, and attempt to infer from
something else...


As I said, trying to take a shortcut and to provide some
additional evidence of intent/logical consistency in case the
sections that I think are important are read as contradicting
those that you cite above.  I don't read the spec as
contradictory, but as specifying three sets of rules: LDH
(recommended), ASCII (required for "existing" RRs), binary
(possible future extensions in new RRs).

  | (although, personally, I think it is clear -- especially
in 2.3.3--   | if read carefully)

2.3.3 is about character case, and I agree, that is a very
messy area indeed.

  | and examine
  | the requirement that, for the defined RRs, labels and
queries be   | compared in a case-insensitive way.

Not quite.   What it says that ascii labels (ones with the top
bit clear) must be handled that way, it carefully refrains
from saying what should be done in other cases - leaving that
for future definition (which is kind of what this recent work
has all been about).   However, it clearly allows non-ascii
labels - it just doesn't specify what they mean, or how to
interpret them.   That's what needed to be done to allow
non-ascii names to have some kind of meaning.


More on this below (I'm going to paste in an earlier analysis,
with the text cited, rather than screwing it up by trying to
reconstruct) but I don't see anything in the text that says "if
you see the high bit set, you can assume it is binary and other
rules apply; if the high bit is zero, then it is ASCII and needs
case-independent comparison".  It seems to me that a statement
of that general nature would be needed to justify your assertion
above.  I note with interest that even 2181 doesn't seem to
include such a statement as a clarification of what is an "ascii
label" and what is a "binary label".  What it says instead
(section 11) is

                Those restrictions aside, any binary string whatever can
                be used as the label of any resource record.  Similarly,
                any binary string can serve as the value of any record
                that includes a domain name as some or all of its value
                (SOA, NS, MX, PTR, CNAME, and any others that may be
                added).

and, from the abstract, where "the last two" refer to the
canonical name issue and the valid contents of a labeL:

                The other two are already adequately specified, however
                the specifications seem to be sometimes ignored.  We
                seek to reinforce the existing specifications.

From which I assume that 2181 did not intend to change anything

about 1034/1035 in this area and that its approval by the IESG
was based on that assumption.

 | So I believe that the "future RRs" language with regard to
  | binary labels in 1034 and 1035 must be taken seriously and
as   | normative text: if new RRs (or new classes) are
defined, they   | can be defined as binary and,

Have you actually thought about what you have just said?
That is, the rules for naming the DNS tree depend upon the
data that is stored there?

Do you seriously mean that?


I think what I'm suggesting is that the valid content of a given
label depends on the RR type (and Class) with which it is
associated.  One can question the wisdom of that in retrospect,
but that it what the specification says.

Classes are a whole other mess, that no-one really seems to
understand, one of those "this might be a good idea" frills,
that is completely undefined. It isn't clear whether different
classes share the same namespace or not (just they they share
a few RR type definitions).   Classes are essentially extinct.


We could debate that too, but I agree that it does not seem
important at this stage, except, perhaps, to understanding where
binary labels might be used.

  | hence, as not requiring
  | case-insensitive comparisons.  Conversely, within the
current   | set (or at least the historical set at the time of
1034/1035),   | case-insensitive comparison is required and
hence binary must   | not be permitted.

Case insensitive comparison of ascii is required, what is done
with the rest is undefined.   To make it meaningful it needs
to be defined, that I agree with.

One easy (though perhaps not desirable, I don't know) solution
would be to simply restrict the case insensitive part, as far
as the DNS is concerned, to ascii only, so that A==a but Á!=á.
Eventually doing away with case insensitive for all labels
seems like a good idea to me.


Of course, to support "case insensitivity for ASCII only", it
would be nice to have an algorithmic rule for identifying ASCII.
But binary labels can, in principle, have octets with the high
bit clear, or even all octets with the high bit clear.  And one
does not want to apply case-insensitivity matching to binary
labels, not matter how they are structured.  So, I believe, in
logic, that one needs to know, on a per-RR type (or, in
principle, per-query-type or other per-query) basis, whether the
comparison involves character comparison (hence case insensitive
over at least some of the octets) or binary comparison
(comparison of bits, no fussing).

I can't comment on whether doing away with case insensitivity is
a good idea, since I can argue either for of against it in new
applications.  But the transition from case insensitive
comparison to case sensitive (or binary) comparison would be a
very interesting exercise.

  | Any other reading, I believe, leads immediately either to
  | contradictions or to undefined states within the protocol.

Undefined, yes.   That's not unusual, lots of protocols have
undefined states.


See below.

  | As an aside, it appears to me that this requirement for
  | case-insensitive comparison is the real problem with "just
put   | UTF-8 in the DNS" approaches.

Not really - what causes the problem is putting more than
ascii there. As soon as you permit that, you have to deal with
all of the issues. The way the bytes are encoded is
irrelevant.   One way out of this is to require that the DNS
always use the "lower" case (whatever that happens to be in
any particular instance - that is, whenever multiple
characters are generally assumed to mean the same, pick one as
the one that must always be used within the DNS) and have the
resolver enforce it.   Whether the data once chosen is encoded
in UTF-8 or some other way is irrelevant.


Except that some of those "other ways" may result in octets with
the high bit clear that do not represent ASCII characters
(assuming, as you do, that 1034/1035 require case insensitive
comparison for ASCII only).  In other words, the DNS needs to
know something about the encoding in order to know when to apply
case insensitive comparison (and, potentially, how to do it).

The problem with doing this is that it requires every resolver
to be able to handle every possible case mapping (for any
domain that it may ever encounter - which is all of them, or
course).   On the other hand, doing it in the server only
requires the server to understand the case folding rules for
the actual domain names it serves, not necessarily anyone
else's (back end caches have a problem either way of course)


I think this is correct.   While I haven't done the analysis, my
intuition tells me that, if we are going to go down this path on
the server side, we may have big problems potentially-recursive
RRs like DNAME and NAPTR, but that is a separate problem.  I
hope.

In any case, these are the issues that a WG that was tasked
with defining how the DNS should treat non ascii labels should
be dealing with. Currently, there's none of that happening -
idn simply decided not to bother, and make everything inside
the DNS remain ascii forever. (Recently I have seen some
ramblings about long term conversion from ACE to UTF-8 inside
the DNS - that's a ludicrous prospect that can never happen).


Yes.

  | An existing and conforming
  | implementation has no way to do those required
case-insensitive   | comparisons outside the ASCII range.

No, nor is it required to.


There we probably disagree -- I suggest that the text is at
least ambiguous and might require it.  But, at some level, it
isn't important, because the text clearly prohibits non-ASCII
labels in "existing" RRs.  See below.

  | One supposes that we could modify the protocol to specify
that   | case-insensitive comparisions be made only for octets
in the   | ASCII range, but, unless that were done through an
EDNS option,   | it would be a potentially fairly significant
retroactive change.

That's not actually a modification, that's what is currently
required.


Not my reading of sections you didn't cite.  See below.


That earlier analysis (slightly updated) and the text
citations...

[...]

... and that has led me to carefully re-read old text.
That, in turn, leads to a question: it is very clear that
nothing in the DNS spec requires the LDH rule, even though it
appears as "prudent user" guidance in section 2.3.1 of RFC 1035
(and elsewhere). But it appears to me that binary labels are not
permitted on the common RR types, for at least one
technically-rational reason, and that 2181 glosses this over a
bit.

Specifically...

From RFC1034, section 3.1


                By convention, domain names can be stored with arbitrary
                case, but domain name comparisons for all present domain
                functions are done in a case-insensitive manner,
                assuming an ASCII character set, and a high order zero
                bit.  This means that you are free to create a node with
                label "A" or a node with label "a", but not both as
                brothers; you could refer to either using "a" or "A".
                When you receive a domain name or label, you should
                preserve its case.  The rationale for this choice is
                that we may someday need to add full binary domain names
                for new services; existing services would not be
                changed.

That statement is presumably part of your justification for
assuming that all bets are off if the high order bit is on.
Whether that is important depends on what "existing services"
refers to, plus the problem of binary labels that don't happen
to contain octets with the high bit set and how they are to be
recognized and thence compared.


and RFC1035:

                2.3.3. Character Case
                
                For all parts of the DNS that are part of the official
                protocol, all comparisons between character strings
                (e.g., labels, domain names, etc.) are done in a
                case-insensitive manner.  At present, this rule is in
                force throughout the domain system without exception.
                However, future additions beyond current usage may need
                to use the full binary octet capabilities in names, so
                attempts to store domain names in 7-bit ASCII or use of
                special bytes to terminate labels, etc., should be
                avoided.

I'm inclined to read "additions beyond current usage" as
implying new RRs or new Classes, you are inclined to read it as
having octets with the high bit on appear in existing RRs.  It
seems to me that this is at least a bit ambiguous, rather than
crystal-clear in the latter direction.  More important, it
appears to me to make a clear (and necessary) distinction
between "character strings" and "full binary octet capabilities"
in the DNS, to require case-insensitive comparison only for the
former, and hence to require that one be able to tell the
difference unambiguously.

But the first part of this does say "For all parts of the DNS
that are part of the official protocol, all comparisons between
character strings ...  are done in a case-insensitive manner."
To emphasize, that is "all parts" and "all comparisons", not
"unless you happen to find the high bit turned on".  So, in the
absence of some standards-track document that changes the
comparison rule -- either for new RRs or retroactively for
existing ones -- it seems to me that we are stuck with it.  And
that "However" sentence seems to apply to storage forms in
implementations, not to what is permitted in labels or queries.

[...]
The requirement to do case-mapping is, I think, ultimately a
restriction on the labels.  It makes it hard for me to think
about the interpretation of a binary label unless the label is
specified as "binary" as part of the description of the
associated RR.  Indeed, given understanding we have gained with
the IDN WG (which PVM probably didn't have when 1034/1035 and
their predecessors were written), it makes it hard for me to
think about anything but ASCII for anything but new RRs (or,
potentially, classes).  Moreover, the text of 1034/1035 appears
to me to require ASCII labels for all RR types specified in
those documents, and maybe even for all new RR types that don't
explicitly specify binary labels.

And, if after going through this, you find that we are still
reading the text differently, I suggest that 2181 probably need
updating to clarify how one of those "any binary string" labels
are to be interpreted when they appear in queries that require
case-insensitive matching.  Otherwise, we have what appears to
be a very strong statement about what is permitted with no
specification at all about how it is handled if one appears.
That doesn't seem to me to be the path to interoperability.

     john