ietf
[Top] [All Lists]

RE: Will Language Wars Balkanize the Web?

2000-12-06 10:50:02
I can't agree more.

-----Original Message-----
From: John C Klensin [mailto:klensin(_at_)jck(_dot_)com]
Sent: 06 December 2000 16:46
To: vint cerf
Cc: ietf(_at_)ietf(_dot_)org; idn(_at_)ops(_dot_)ietf(_dot_)org
Subject: Re: Will Language Wars Balkanize the Web?


(Can we please move this discussion to the IDN list, where it
belongs?)

--On Wednesday, 06 December, 2000 08:19 -0500 vint cerf
<vcerf(_at_)MCI(_dot_)NET> wrote:

Mr. Ohta has put his finger on a key point: ability of all
parties to generate email addresses, web page URLs and so on.
Even if we introduce extended character sets, it seems vital
that there be some form of domain name that can be rendered
(and entered) as simple IA4 characters to assure continued
interworking at the most basic levels. This suggests that
there is need for some correspondence between an IA4 Domain
Name and any extended characterset counterpart.

Vint,

I think I agree with the principle.  However, there are several
different models with which the "correspondence" can be
implemented.  The difference among them is quite important
technically --implementations would need to occur in different
places and with different implications, deployment times, and
side effects--  and perhaps as important philosophically.  E.g.,
let me try to identify some of the in extreme form to help
identify the differences:

(i) The names in the DNS are "protocol elements".  They should
be expressed in a minimal subset of ASCII so that they can be
rendered and typed on almost all of the world's equipment (the
assumption that, e.g., all Chinese or Arabic keyboards and
display devices in the medium to long term will contain Roman
characters seems a little dubious).  There is no requirement
that they be mneumonic in any language: in principle, a string
containing characters selected at random would do as well as the
name of a company, person, or product.

This model gives rise to directory and keyword systems (most of
them outside the DNS) that contain the names that people use.
While the registration and name-conflict problems are
non-trivial, names in multiple languages and character codings
can easily map onto a single DNS identifier.  On the other hand,
binding a national-language name to an ASCII name would need to
be done either by parallel registrations or by matching on
keywords (and the latter might not yield unambiguous and
accurate results).

(ii) Entries in the DNS are always coded.  After all, "ASCII" is
just a code mapping between a human-visible character set and a
machine (or wire) representation.  It is the job of an
application to get from "characters" to "codes" and back, and to
recognize coding systems and applying the correct decodings.
And software that is old or broken will simply display a
different rendering of the coded form (whether that is a
"hexification" such as Base64 or some other system).  

This model gives rise to the "ACE all the way up" models, in
which non-ASCII names are placed in the DNS using some tagging
system, but the "ASCII representation" of a name that, in the
original, uses non-Roman characters, may be quite ugly and bear
no connection with the name as it would be rendered using the
original characters other than an algorithmic one.   It also
gives rise to some of the UTF-8 models, on the assumption that
applications that can't handle the full IS 10646 character set
can do something intelligent.

(iii) Regardless of how the names in the DNS are coded, it is
important to have analogies to "two sided business cards".  Such
systems assume that any name rendered in a non-Roman character
set should have an analogue in Roman characters.  And those
analogues are expected to be bound to the original form by
transliteration or translation -- they aren't just random,
algorithmically matching, strings.

While there need to be facilities for the non-Roman (even
non-ASCII) characters in either the DNS or a directory,
establishing the "ASCII names" is, of necessity, a registration
issue rather than an algorithmic issue.  We don't know how to do
the "translation" (or, in the general case, even
transliteration) algorithmically.   To give one example, despite
the "Han unification" of IS 10646, the characters on a Japanese
business card for you would almost certainly be different from
those on a Chinese business card for you.    And, because of the
registration issue, there is no plausible way to impose a
requirement that every host (or other DNS entry) have a name in
ASCII if it has a name in some other script: people and hosts
not visible outside their own countries may not care enough to
go to the trouble.

These models are not mutually exclusive.  But they are
definitely different perspectives.

It is also worth noting that, as a matter of perspective, the
dominance of subsets of ASCII in these debates has some
important technical advantages (e.g., the code set can be made
very small and the canonicalization/matching rules are
algorithmic, universally-agreed, and trivial), but it is also
significantly an historical accident.  Because of that
historical accident, we tend to couch these discussions (as your
note does and as I have done above) in terms of ASCII <->
something-else mappings.  But it isn't hard to imagine a
business card containing Thai and Chinese, or Vietnamese and
Sanscrit, or Hebrew and Arabic.  It would be interesting, but
impractical in the extreme, to try to insist that all DNS names
be renderable in all languages and the associated character
repertoires (and more impractical if we insisted that the
renderings have "meaning").  But I think we need to remember
that limiting case as we try to figure out what should be done
here.

      john