Re: Impending publication: draft-iab-idn-nextsteps-05

Dear Mark,

I thank you for this list of remarks. As engaged into the MDRSproject of multilingual distributed registry system for our owninformation service to users communities and open use, I share manyof these points. Let me add the comments/suggestions we have.


At 07:52 23/04/2006, Mark Davis wrote:

The UTC strongly supports many of the goals of the document,including especially improving the security of IDNs, and updatingthe version of Unicode used in NamePrep and StringPrep (since theold version of Unicode they require excludes or hampers many languages).

My user QA hat on: there is a need to have a list or a place torecord the different available libraries and the Unicode version theysupport. The same for the langtags libraries. This should be a normalIDNA and RFC 3066 Bis "after sales" support.

There are, however, a number of areas of concern.
As a general issue, we'd urge closer cooperation between the IAB andthe Unicode consortium on the document, so that the characterencoding and software internationalization issues can be reviewed byexperts in the field, and accurately represented in the document.


There are two layers. Characters and languages in constant confusion.

There is certainly advantage in having a better knowledge of IETF onboth layers. For years the IETF is lobbied by Unicode Members in thisarea. The result is confusion, as the RFC 3066 Bis episode andaftermaths show it. I understand that the commercial nature of theUnicode is a problem for the IETF. But the result on the UnicodeMember participating to the IETF is an exclusion natural reflex: theytry to keep a coherent doctrine. This may create hurting behavioursand ethic problems. This is why I would strongly advocate an MoU onthe characters related aspects between the IETF and Unicode. Such anMoU should clearly identify the ISO as a reference, Unicode as a nonexclusive but leading source of expertise, and final leadership ofthe users in the usage areas.

This would permit to differentiate the internet architecturalcharacters layer where Unicode is truly a reference, and the linguallayer where there is total confusion. Internationalization (extendingthe character set with non-ASCII characters) is by natureinapplicable to multilingualisation (parallel support of excludingterminologies to express the same concepts). The result is cacophony,like in IDNA.

As RFC 3066 Bis shows it the Unicode/IETF doctrine and tools(libraries and CLDR) is unable at this stage to address dialects,sociolects and idiolects. This is quite worrying since this may be anarea of urgent standardisation effort and certainly strong demandprompted by the Internet possibilities. This is certainly asignificant MDRS concern and we know we will actively use digitalspace names (not only Internet) to support them.

The chief area of concern is section 4.3.

4.3.  Combining Characters and Character Components

One thing that increases IDNA complexity and the need for
normalization is that combining characters are permitted.  Without
them, complexity might be reduced enough to permit more easy
transitions to new versions.  The community should consider whether
combining characters should be prohibited entirely from IDNs.  A
consequence of this, of course, is that each new language or script
would require that all of its characters have Unicode assignments to
specific, precomposed, code points, a model that the Unicode
Consortium has rejected for Roman-based scripts.  For non-Roman
scripts, it seems to be the Unicode trend to define such code points.
At some level, telling the users and proponents of scripts that, at
present, require composing characters to work the issues out with the
Unicode Consortium in a way that severely constrains the need for
those characters seems only appropriate.  The IAB and the IETF should
examine whether it is appropriate to press the Unicode Consortium to
revise these policies or otherwise to recommend actions that would
reduce the need for normalization and the related complexities.
The descriptions and recommendations in this section are simply notfeasible. They do not recognize the fundamental importance ofcombining marks as an integral component of a great many scripts,nor do they recognize the fundamental need for compatibility that isrequired of the Unicode Standard. Asking for combining characters tobe removed is akin to asking English vowels to be removed, and allpossible syllables to be encoded instead. There are, as well, anumber of purely factual errors. For example, "it seems to be theUnicode trend to define such code points" is simply incorrect. Thissection serves no purpose but to betray a basic lack ofunderstanding of scripts; it needs to be removed entirely.

The worry I have here is the reference to Unicode as if having thecapacity to do this. It simply denotes a lack of understanding of theaddressed issue. The world does not go by Unicode but by ISO. Even ifUnicode and ISO are considered as hand in glove such a text shows theneed of the MoU I advocate, at least to underline that ISO is thecommon reference. At least if the IETF wants to be international. Buteven if the IETF wanted to limit itself to the Internationalised USInternet such a request would be a request for balkanization.

A second area of major concern is Section 2.2.3.

2.2.3.  Normalization and Character Mappings

Unicode contains several different models for representing
characters.  The Chinese (Han)-derived characters of the "CJK"
languages are "unified", i.e., characters with common derivation and
similar appearances are assigned to the same code point.  European
characters derived from a Greek-Roman base are separated into
separate code blocks for "Latin", Greek and Cyrillic even when
individual characters are identical in both form and semantics.
Separate code points based on font differences alone are generally
prohibited, but a large number of characters for "mathematical" use
have been assigned separate code points even though they differ from
base ASCII characters only by font attributes such as "script",
"bold", or "italic".  Some characters that often appear together are
treated as typographical digraphs with specific code points assigned
to the combination, others require that the two-character sequences
be used, and still others are available in both forms.  Some Roman-
based letters that were developed as decorated variations on the
basic Latin letter collection (e.g., by addition of diacritical
marks) are assigned code points as individual characters, others must
be built up as two (or more) character sequences using "composing
characters".
This section betrays a lack of understanding of the fundamentaldifferences between Han characters and the scripts Latin, Greek, and Cyrillic.
Many of these differences result from the desire to maintain backward
compatibility while the standard evolved historically, and are hence
understandable.  However, the DNS requires precise knowledge of which
codes and code sequences represent the same character and which ones
do not.  Limiting the potential difficulties with confusable
characters (see Section 2.2.6) requires even more knowledge of which
characters might look alike in some fonts but not in others.  These
variations make it difficult or impossible to apply a single set of
rules to all of Unicode.  Instead, more or less complex mapping
tables, defined on a character by character basis, are required to
"normalize" different representations of the same character to a
single form so that matching is possible.
The Unicode consortium *does* supply a precise mechanism fordetermining when two strings represent the same underlying abstractcharacters. These do supply a single set of rules to all of Unicode,based on a set of data that is in the Unicode Character Database.
This paragraph also conflates the confusable issue with characterequivalence. These are separate issues: there are great manyinstances where characters are confusable where they are not at allequivalent (such as zero and the letter O).
... The fact
that most or all scripts included in Unicode have been initially
incorporated by copying an existing standard more or less intact has
impact on the optimization of these algorithms and on forward
compatibility.  Even if the language is known and language-specific
rules can be defined, dependencies on the language do not disappear.
Any canonicalization operations that depend on more than short
sequences of text is not possible to do without context.  DNS lookups
and many other operations do not have a way to capture and utilize
the language or other information that would be needed to provide
that context.
First, it is neither "most" nor "all". Very few scripts,proportionately, have been incorporated by copying an existingstandard. Second, "Any canonicalization operations that depend onmore than short sequences of text is not possible to do withoutcontext...." is difficult to make sense of. One would have toexplain the sense of "canonicalization" that is being discussed. Itcould be as trivial as "language-based canonicalization isimpossible without language information", which is true, but abovethe document argues against using language-based equivalences on aglobal basis (and for very good reason!)

This is clearly the result of layer violation confusion by networkarchitects between characters and languages issues. The solution isnot in changing ISO but in preventing the problem to exist on thenetwork side.

===

Other areas of concern:

(more properly "Roman", see below)
The common modern practice in the naming of the script is to use theterm "Latin", not "Roman". Whether or not one thinks that should nothave been the case, insisting on older terms is pointless, and notgermane to the purpose of the document.


+1
they were bidi used in Latium before Rome was even born.
They are Etruscan.

When writing or typing the label (or word), a script must be selected
and a charset must be picked for use with that script.
This is confusing charset, keyboard and script. Saying "a scriptmust be selected" is *neither* true from the user's perspective, nordoes it at all match the implementation pipeline from keypress tostorage of a label. What may have been confusing for the authors isthat sometimes keyboards that are listed for selection are sorted byscript; that does not, however, mean that a "script is selected".
The proper word, if more substantial changes are not made to thewording, would be "a keyboard must be selected". (Even that is aquite odd, since it implies that that is done each time a user types a label.)

This is what happens in IDNs. Only so called "IDN.IDN" can be typedwith a single keyboard.

If that charset, or the local charset being used by the relevant
operating system or application software, is not Unicode, a further
conversion must be performed to produce Unicode.  How often this is
an issue depends on estimates of how widely Unicode is deployed as
the native character set for hardware, operating systems, and
applications.  Those estimates differ widely, with some Unicode
advocates claiming that it is used in the vast majority of systems
and applications today.  Others are more skeptical, pointing out
that:

o  ISO 8859 versions [ISO.8859.1992] and even national variations of
   ISO 646 [ISO.646.1991] are still widely used in parts of Europe;
o  code-table switching methods, typically based on the techniques of
   ISO 2022 [ISO.2022.1986] are still in general use in many parts of
   the world, especially in Japan with Shift-JIS and its variations;
o  that computing, systems, and communications in China tend to use
   one or more of the national "GB" standards rather than native
   Unicode;
o  and so on.

Not all charsets define their characters in the same way and not all
pre-existing coding systems were incorporated into Unicode without
changes.  Sometimes local distinctions were made that Unicode does
not make or vice versa.  Consequently, conversion from other systems
to Unicode may potentially lose information.

Most of this section is unnecessary and the thrust of it ismisleading. The only issue is "local distinctions" are lost whenconverting to Unicode; that doesn't happen when converting from anyof the examples listed. This passage implies that there aresignificant problems in mapping to Unicode in doing IDN, and theresimply aren't.

This can only be documented by a complete registry documenting allthe actually existing/used charsets. We cannot go by "Most", we mustgo by "Every".

... Worse, one needs to be reasonably
familiar with a script and how it is used to understand how much
characters can reasonably vary as the result of artistic fonts and
typography.  For example, there are a few fonts for Latin characters
that are sufficiently highly ornamented that an observer might easily
confuse some of the characters with characters in Thai script.
The confusion of Latin with Thai is a red herring. It would take anexceedingly contrived scenario for it to present a problem. Thereare plenty of realistic scenarios involving confusables across, say,Latin and Cyrillic.
... IDNA
prohibits these mixed-directional (or bidirectional) strings in IDN
labels, but the prohibition causes other problems such as the
rejection of some otherwise linguistically and culturally sensible
strings.  As Unicode and conventions for handling so-called
bidirectional ("BIDI") strings evolve, the prohibition in IDNA should
be reviewed and reevaluated.
Deviating from the practices already built into IRI would be amistake. As the document recognizes above, it cannot be a goal torepresent all possible "linguistically and culturally sensiblestrings" in IDNs. The restrictions on BIDI are ones that haveachieved broad consensus as the minimal ones to help avoid somefairly serious security issues.

This is character/language layer violation. Mark presents a soundcharacter layer solution. This does not address all of the languageissue. If this is a real problem at language issue, that issue shouldbe specified separately. This is the only way to have an operationalservice and further on may be to improve it. DNS does not supportupper cases. IDNA restricts on BIDI. May be solution found for themailnames will help finding a solution.

4.1.2.  Elimination of word-separation punctuation
... We might even
consider banning use of the hyphen itself in non-ASCII strings or,
less restrictively, strings that contained non-Roman characters.
This section is not well motivated. The authors need to justify whysuch characters represent a problem (and one of such a seriousnature that hyphens should be disallowed).

Hyphen removal would remove the possibility to include langtags in adomain name to support multilingual versions of a site. Better toscrap RFC 3066 Bis then.

-----

* Section 2.2.3: "characters that are essentially identical will not match"
What is meant by "essentially identical"? Does this mean identicalin appearance, identical in internal representation, identical insemantics, canonically equivalent (same NFC forms), or compatibleequivalent (same NFKC forms)? The intent needs to be clarified,otherwise the statement is subject to misinterpretation.

+1

* Section 2.2.3: "This Unicode normalization process [does notaccount for] equivalences that are language or script dependent"Which what is meant by "script-dependent equivalences"? Can youprovide an example?

What are language equivalences for the DNS. USA.com andUnited-States-of-America.com are language equivalent DNs. Isregistering one forbidding to register the other?

* Section 2.2.3: "U+00F8 [...] and U+00F6 [...] are considered tomatch in Swedish""Match" needs some clarification. In accordance with Swedishstandards, when collating with Swedish locale, all majorimplementations match these characters at the first and secondlevel, but not at a lower level. Thus they are not exact matches:this might be better phrased in terms of equivalence.
* Section 2.2.3: "Even if the language is known andlanguage-specific rules can be defined, dependencies on the languagedo not disappear"
It is unclear what this means. Could you give an example?

+1

* Section 2.2.1: "Those characters are not treated as equivalentaccording to the Unicode consortium while...".This is somewhat ad hominem. It should rather be "...according tothe Unicode Standard while..."

ISO 10646 is the only reference which should be used. The word"Unicode" should not be used. This is like quoting WGs in an RFC.

* Section 2.2.1: "..confusion in Germany, where the U+00F8 characteris never used in the language".That is not true, there are entries with that character in the Dudendictionary.
* Section 2.2.4: "This is because [...] some glyphs [...] have beenassigned different code points in Unicode".
This is incorrect: glyphs are not assigned to code points; characters are.
* Section 2.2.6: "Is the answer the same for words two [sic]different languages that translate into each other?".This is completely orthogonal to IDNs (cf "Is 'cat' the same as'gato' or the same as 'katze'?").

+1

* Section 2.2.7: "the IESG statement [...] that a registry shouldhave a policy about the scripts, languages, code points and text directions".This appears to not be an accurate paraphrase of(http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt). Thatdocument rather says a registry "MIGHT want to prevent particularcharacters", "MIGHT want to automatically generate a list of (...)strings and suggest that they also be registered" and lastly "it issuggested that a registry act conservatively". There is no suchthing as "SHOULD" wording and, for instance, text direction is not mentioned.

Such a policy has no real interest anyway as IDNA does not imposethat policy on further DN levels.

* Section 2.2.8: "This maybe [...] because many other applicationsare internally sensitive only to the appearance of characters andnot to their representation".This is reversed. The vast majority of application are internallysensitive only to the representation, not to the appearance.Exceptions would be OCR, for example.

+1

* Section 2.2.8: "A change in a code point assignment (...) may beextremely disruptive".This suggests that the consortium capriciously changes code points.After the merger with ISO 10646 there was only one point at whichthe Unicode consortium changed code points: Unicode 2.0.0 (July,1996): The characters in the Korean Hangul block were moved to bepart of a new, larger block with all 11,152 Hangul syllables.
As a result of the disruption that this caused, the UnicodeConsortium and ISO/IEC SC2 resolved never to change code points inthe future, and no changes have ever been done since.
* Section 3.1.1: "...such as code points assigned to font variations...".
Which characters are these referring to? Is it to just charactersthat are resolved by an NFKC normalization, or does it refer to others?
* Section 4.5: "the whois protocol itself (...) is ASCII-only".
This appears to be inaccurate. The Whois protocol(http://www.ietf.org/rfc/rfc3912.txt?number=3912) has no mechanismsto indicate which character encoding is being used, but the protocolis 8-bit clean and it is indeed used so by many (for instance, DENIChas a UTF-8 implementation up and running).

+1

To be noted: ML Domain Names are to be much closer to the people. Andtherefore to their local laws. Privacy regulations are betterrespected in banning current Whois service.

As an addition to these remarks, I think that the solution to thediscussed problems is a character/implementers/language debate todefine a globally supported digital naming acceptable ISO10646restriction. The purpose of which not being to support any name (whatIDNA does not do anyway) but to provide a secure (anti phishing)threehexadecimal network name coding system.


jfc







_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf