Re: [idn] Re: 7 bits forever!

    Date:        Fri, 05 Apr 2002 14:41:53 -0500
    From:        John C Klensin <klensin(_at_)jck(_dot_)com>
    Message-ID:  <9863660(_dot_)1018017713(_at_)localhost>

I really hoped to be able to avoid having to do this, yet again...

  | As I read them, what 1034 and 1035 say is that the DNS can
  | accomodate any octets, but that [at least then]
  | currently-specified RRs are restricted to ASCII.

Sorry John, I can't fathom how you could possibly reach that conclusion
from what is in 1034 & 1035.

Eg: from 1035 (section 3.1) ...

        Although labels can contain any 8 bit values in octets that make up a
        label, it is strongly recommended that labels follow the preferred
        syntax described elsewhere in this memo, which is compatible with
        existing host naming conventions.

How much clearer do you want it?

1034 is less clear, but (section 3.5 --- note its title)

  3.5. Preferred name syntax

  The DNS specifications attempt to be as general as possible in the rules
  for constructing domain names.  The idea is that the name of any
  existing object can be expressed as a domain name with minimal changes.
  However, when assigning a domain name for an object, the prudent user
  will select a name which satisfies both the rules of the domain system
  and any existing rules for the object, whether these rules are published
  or implied by existing programs.

"The prudent user will select" ...  that is, this is a damn good idea,
but you don't have to do it if you know what you're doing.

  | The LDH rule is a good ("best"?) practices one.

It is required (as updated) if the domain name is to be used in an e-mail
header (which back then, was almost the only other formalised place that
domain names appeared - other than that was all OS specific command/arg stuff).

  | But the ASCII rule is a firm requirement.

No it isn't, there is nothing at all which says that.

  | For evidence of this, temporarily ignore the text

Hmm - ignore what is written, and attempt to infer from something else...

  | (although, personally, I think it is clear -- especially in 2.3.3--
  | if read carefully)

2.3.3 is about character case, and I agree, that is a very messy area indeed.

  | and examine
  | the requirement that, for the defined RRs, labels and queries be
  | compared in a case-insensitive way.

Not quite.   What it says that ascii labels (ones with the top bit clear)
must be handled that way, it carefully refrains from saying what should
be done in other cases - leaving that for future definition (which is kind
of what this recent work has all been about).   However, it clearly allows
non-ascii labels - it just doesn't specify what they mean, or how to
interpret them.   That's what needed to be done to allow non-ascii names
to have some kind of meaning.

  | So I believe that the "future RRs" language with regard to
  | binary labels in 1034 and 1035 must be taken seriously and as
  | normative text: if new RRs (or new classes) are defined, they
  | can be defined as binary and,

Have you actually thought about what you have just said?   That is,
the rules for naming the DNS tree depend upon the data that is stored
there?

Do you seriously mean that?

Classes are a whole other mess, that no-one really seems to understand,
one of those "this might be a good idea" frills, that is completely undefined.
It isn't clear whether different classes share the same namespace or
not (just they they share a few RR type definitions).   Classes are
essentially extinct.

  | hence, as not requiring
  | case-insensitive comparisons.  Conversely, within the current
  | set (or at least the historical set at the time of 1034/1035),
  | case-insensitive comparison is required and hence binary must
  | not be permitted.

Case insensitive comparison of ascii is required, what is done with the
rest is undefined.   To make it meaningful it needs to be defined, that
I agree with.

One easy (though perhaps not desirable, I don't know) solution would be
to simply restrict the case insensitive part, as far as the DNS is
concerned, to ascii only, so that A==a but Á!=á.   Eventually doing away
with case insensitive for all labels seems like a good idea to me.

  | Any other reading, I believe, leads immediately either to
  | contradictions or to undefined states within the protocol.

Undefined, yes.   That's not unusual, lots of protocols have undefined states.

  | As an aside, it appears to me that this requirement for
  | case-insensitive comparison is the real problem with "just put
  | UTF-8 in the DNS" approaches.

Not really - what causes the problem is putting more than ascii there.
As soon as you permit that, you have to deal with all of the issues.
The way the bytes are encoded is irrelevant.   One way out of this is
to require that the DNS always use the "lower" case (whatever that happens
to be in any particular instance - that is, whenever multiple characters are
generally assumed to mean the same, pick one as the one that must always be
used within the DNS) and have the resolver enforce it.   Whether the
data once chosen is encoded in UTF-8 or some other way is irrelevant.

The problem with doing this is that it requires every resolver to be able
to handle every possible case mapping (for any domain that it may ever
encounter - which is all of them, or course).   On the other hand, doing it
in the server only requires the server to understand the case folding rules
for the actual domain names it serves, not necessarily anyone else's
(back end caches have a problem either way of course)

In any case, these are the issues that a WG that was tasked with defining
how the DNS should treat non ascii labels should be dealing with.
Currently, there's none of that happening - idn simply decided not to
bother, and make everything inside the DNS remain ascii forever.
(Recently I have seen some ramblings about long term conversion from ACE to
UTF-8 inside the DNS - that's a ludicrous prospect that can never happen).

  | An existing and conforming
  | implementation has no way to do those required case-insensitive
  | comparisons outside the ASCII range.

No, nor is it required to.

  | One supposes that we could modify the protocol to specify that
  | case-insensitive comparisions be made only for octets in the
  | ASCII range, but, unless that were done through an EDNS option,
  | it would be a potentially fairly significant retroactive change.

That's not actually a modification, that's what is currently required.

kre