Re: Impending publication: draft-iab-idn-nextsteps-05

Let me see if I can clarify the situation.

Once a version of Unicode is issued, the consortium makes no retroactivechanges. Thus if someone claims and correctly implements conformance to"Unicode Version X", their implementation will remain conformant to thatversion forever.

The corrigenda *can be* applied to previous versions, if someone wantsto: but in that case someone would claim and implement conformance to"Unicode Version X *plus* Corrigendum Y".

On the larger issue of problems inhttp://www.ietf.org/internet-drafts/draft-iab-idn-nextsteps-05.txt, theconsortium has communicated a number of problems in that document to theIAB and the authors, but has never gotten a response on any of thoseproblems. While the issue appears to be moot, since the document isbeing issued without consideration of those flaws, I'll copy thecomments here for those interested. The original comments were onhttp://www.iab.org/documents/drafts/draft-iab-idn-nextsteps-02.txt; butthe text has not improved regarding these issues since that point.


=====

The UTC strongly supports many of the goals of the document, includingespecially improving the security of IDNs, and updating the version ofUnicode used in NamePrep and StringPrep (since the old version ofUnicode they require excludes or hampers many languages). There are,however, a number of areas of concern.

As a general issue, we'd urge closer cooperation between the IAB and theUnicode consortium on the document, so that the character encoding andsoftware internationalization issues can be reviewed by experts in thefield, and accurately represented in the document.


The chief area of concern is section 4.3.

4.3.  Combining Characters and Character Components

One thing that increases IDNA complexity and the need for
normalization is that combining characters are permitted.  Without
them, complexity might be reduced enough to permit more easy
transitions to new versions.  The community should consider whether
combining characters should be prohibited entirely from IDNs.  A
consequence of this, of course, is that each new language or script
would require that all of its characters have Unicode assignments to
specific, precomposed, code points, a model that the Unicode
Consortium has rejected for Roman-based scripts.  For non-Roman
scripts, it seems to be the Unicode trend to define such code points.
At some level, telling the users and proponents of scripts that, at
present, require composing characters to work the issues out with the
Unicode Consortium in a way that severely constrains the need for
those characters seems only appropriate.  The IAB and the IETF should
examine whether it is appropriate to press the Unicode Consortium to
revise these policies or otherwise to recommend actions that would
reduce the need for normalization and the related complexities.

The descriptions and recommendations in this section are simply notfeasible. They do not recognize the fundamental importance of combiningmarks as an integral component of a great many scripts, nor do theyrecognize the fundamental need for compatibility that is required of theUnicode Standard. Asking for combining characters to be removed is akinto asking English vowels to be removed, and all possible syllables to beencoded instead. There are, as well, a number of purely factual errors.For example, "it seems to be the Unicode trend to define such codepoints" is simply incorrect. This section serves no purpose but tobetray a basic lack of understanding of scripts; it needs to be removedentirely.


A second area of major concern is Section 2.2.3.

2.2.3.  Normalization and Character Mappings

Unicode contains several different models for representing
characters.  The Chinese (Han)-derived characters of the "CJK"
languages are "unified", i.e., characters with common derivation and
similar appearances are assigned to the same code point.  European
characters derived from a Greek-Roman base are separated into
separate code blocks for "Latin", Greek and Cyrillic even when
individual characters are identical in both form and semantics.
Separate code points based on font differences alone are generally
prohibited, but a large number of characters for "mathematical" use
have been assigned separate code points even though they differ from
base ASCII characters only by font attributes such as "script",
"bold", or "italic".  Some characters that often appear together are
treated as typographical digraphs with specific code points assigned
to the combination, others require that the two-character sequences
be used, and still others are available in both forms.  Some Roman-
based letters that were developed as decorated variations on the
basic Latin letter collection (e.g., by addition of diacritical
marks) are assigned code points as individual characters, others must
be built up as two (or more) character sequences using "composing
characters".

This section betrays a lack of understanding of the fundamentaldifferences between Han characters and the scripts Latin, Greek, andCyrillic.


Many of these differences result from the desire to maintain backward
compatibility while the standard evolved historically, and are hence
understandable.  However, the DNS requires precise knowledge of which
codes and code sequences represent the same character and which ones
do not.  Limiting the potential difficulties with confusable
characters (see Section 2.2.6) requires even more knowledge of which
characters might look alike in some fonts but not in others.  These
variations make it difficult or impossible to apply a single set of
rules to all of Unicode.  Instead, more or less complex mapping
tables, defined on a character by character basis, are required to
"normalize" different representations of the same character to a
single form so that matching is possible.

The Unicode consortium *does* supply a precise mechanism for determiningwhen two strings represent the same underlying abstract characters.These do supply a single set of rules to all of Unicode, based on a setof data that is in the Unicode Character Database.

This paragraph also conflates the confusable issue with characterequivalence. These are separate issues: there are great many instanceswhere characters are confusable where they are not at all equivalent(such as zero and the letter O).


... The fact
that most or all scripts included in Unicode have been initially
incorporated by copying an existing standard more or less intact has
impact on the optimization of these algorithms and on forward
compatibility.  Even if the language is known and language-specific
rules can be defined, dependencies on the language do not disappear.
Any canonicalization operations that depend on more than short
sequences of text is not possible to do without context.  DNS lookups
and many other operations do not have a way to capture and utilize
the language or other information that would be needed to provide
that context.

First, it is neither "most" nor "all". Very few scripts,proportionately, have been incorporated by copying an existing standard.Second, "Any canonicalization operations that depend on more than shortsequences of text is not possible to do without context...." isdifficult to make sense of. One would have to explain the sense of"canonicalization" that is being discussed. It could be as trivial as"language-based canonicalization is impossible without languageinformation", which is true, but above the document argues against usinglanguage-based equivalences on a global basis (and for very good reason!)


===

Other areas of concern:

(more properly "Roman", see below)

The common modern practice in the naming of the script is to use theterm "Latin", not "Roman". Whether or not one thinks that should nothave been the case, insisting on older terms is pointless, and notgermane to the purpose of the document.


When writing or typing the label (or word), a script must be selected
and a charset must be picked for use with that script.

This is confusing charset, keyboard and script. Saying "a script must beselected" is *neither* true from the user's perspective, nor does it atall match the implementation pipeline from keypress to storage of alabel. What may have been confusing for the authors is that sometimeskeyboards that are listed for selection are sorted by script; that doesnot, however, mean that a "script is selected".

The proper word, if more substantial changes are not made to thewording, would be "a keyboard must be selected". (Even that is a quiteodd, since it implies that that is done each time a user types a label.)


If that charset, or the local charset being used by the relevant
operating system or application software, is not Unicode, a further
conversion must be performed to produce Unicode.  How often this is
an issue depends on estimates of how widely Unicode is deployed as
the native character set for hardware, operating systems, and
applications.  Those estimates differ widely, with some Unicode
advocates claiming that it is used in the vast majority of systems
and applications today.  Others are more skeptical, pointing out
that:

o  ISO 8859 versions [ISO.8859.1992] and even national variations of
   ISO 646 [ISO.646.1991] are still widely used in parts of Europe;
o  code-table switching methods, typically based on the techniques of
   ISO 2022 [ISO.2022.1986] are still in general use in many parts of
   the world, especially in Japan with Shift-JIS and its variations;
o  that computing, systems, and communications in China tend to use
   one or more of the national "GB" standards rather than native
   Unicode;
o  and so on.

Not all charsets define their characters in the same way and not all
pre-existing coding systems were incorporated into Unicode without
changes.  Sometimes local distinctions were made that Unicode does
not make or vice versa.  Consequently, conversion from other systems
to Unicode may potentially lose information.

Most of this section is unnecessary and the thrust of it is misleading.The only issue is "local distinctions" are lost when converting toUnicode; that doesn't happen when converting from any of the exampleslisted. This passage implies that there are significant problems inmapping to Unicode in doing IDN, and there simply aren't.


... Worse, one needs to be reasonably
familiar with a script and how it is used to understand how much
characters can reasonably vary as the result of artistic fonts and
typography.  For example, there are a few fonts for Latin characters
that are sufficiently highly ornamented that an observer might easily
confuse some of the characters with characters in Thai script.

The confusion of Latin with Thai is a red herring. It would take anexceedingly contrived scenario for it to present a problem. There areplenty of realistic scenarios involving confusables across, say, Latinand Cyrillic.


... IDNA
prohibits these mixed-directional (or bidirectional) strings in IDN
labels, but the prohibition causes other problems such as the
rejection of some otherwise linguistically and culturally sensible
strings.  As Unicode and conventions for handling so-called
bidirectional ("BIDI") strings evolve, the prohibition in IDNA should
be reviewed and reevaluated.

Deviating from the practices already built into IRI would be a mistake.As the document recognizes above, it cannot be a goal to represent allpossible "linguistically and culturally sensible strings" in IDNs. Therestrictions on BIDI are ones that have achieved broad consensus as theminimal ones to help avoid some fairly serious security issues.


4.1.2.  Elimination of word-separation punctuation
... We might even
consider banning use of the hyphen itself in non-ASCII strings or,
less restrictively, strings that contained non-Roman characters.

This section is not well motivated. The authors need to justify why suchcharacters represent a problem (and one of such a serious nature thathyphens should be disallowed).


-----

* Section 2.2.3: "characters that are essentially identical will not match"

What is meant by "essentially identical"? Does this mean identical inappearance, identical in internal representation, identical insemantics, canonically equivalent (same NFC forms), or compatibleequivalent (same NFKC forms)? The intent needs to be clarified,otherwise the statement is subject to misinterpretation.

* Section 2.2.3: "This Unicode normalization process [does not accountfor] equivalences that are language or script dependent"Which what is meant by "script-dependent equivalences"? Can you providean example?

* Section 2.2.3: "U+00F8 [...] and U+00F6 [...] are considered to matchin Swedish""Match" needs some clarification. In accordance with Swedish standards,when collating with Swedish locale, all major implementations matchthese characters at the first and second level, but not at a lowerlevel. Thus they are not exact matches: this might be better phrased interms of equivalence.

* Section 2.2.3: "Even if the language is known and language-specificrules can be defined, dependencies on the language do not disappear"

It is unclear what this means. Could you give an example?

* Section 2.2.1: "Those characters are not treated as equivalentaccording to the Unicode consortium while...".This is somewhat ad hominem. It should rather be "...according to theUnicode Standard while..."

* Section 2.2.1: "..confusion in Germany, where the U+00F8 character isnever used in the language".That is not true, there are entries with that character in the Dudendictionary.

* Section 2.2.4: "This is because [...] some glyphs [...] have beenassigned different code points in Unicode".

This is incorrect: glyphs are not assigned to code points; characters are.

* Section 2.2.6: "Is the answer the same for words two [sic] differentlanguages that translate into each other?".This is completely orthogonal to IDNs (cf "Is 'cat' the same as 'gato'or the same as 'katze'?").

* Section 2.2.7: "the IESG statement [...] that a registry should have apolicy about the scripts, languages, code points and text directions".This appears to not be an accurate paraphrase of(http://www.ietf.org/IESG/STATEMENTS/IDNstatement.txt). That documentrather says a registry "MIGHT want to prevent particular characters","MIGHT want to automatically generate a list of (...) strings andsuggest that they also be registered" and lastly "it is suggested that aregistry act conservatively". There is no such thing as "SHOULD" wordingand, for instance, text direction is not mentioned.

* Section 2.2.8: "This maybe [...] because many other applications areinternally sensitive only to the appearance of characters and not totheir representation".This is reversed. The vast majority of application are internallysensitive only to the representation, not to the appearance. Exceptionswould be OCR, for example.

* Section 2.2.8: "A change in a code point assignment (...) may beextremely disruptive".This suggests that the consortium capriciously changes code points.After the merger with ISO 10646 there was only one point at which theUnicode consortium changed code points: Unicode 2.0.0 (July, 1996): Thecharacters in the Korean Hangul block were moved to be part of a new,larger block with all 11,152 Hangul syllables.

As a result of the disruption that this caused, the Unicode Consortiumand ISO/IEC SC2 resolved never to change code points in the future, andno changes have ever been done since.


* Section 3.1.1: "...such as code points assigned to font variations...".

Which characters are these referring to? Is it to just characters thatare resolved by an NFKC normalization, or does it refer to others?


* Section 4.5: "the whois protocol itself (...) is ASCII-only".

This appears to be inaccurate. The Whois protocol(http://www.ietf.org/rfc/rfc3912.txt?number=3912) has no mechanisms toindicate which character encoding is being used, but the protocol is8-bit clean and it is indeed used so by many (for instance, DENIC has aUTF-8 implementation up and running).


-----
(a later comment, on the 04 draft)

In my opinion (not speaking for the UTC here), not only were the issuesfrom the UTC not addressed, if anything the document regressed. It nowcontains the following text:


 The IAB and the IETF should examine whether it is appropriate to
 press the Unicode Consortium to revise these policies or otherwise to
 recommend actions that would reduce the need for normalization and
 the related complexities.  However, participants in the Unicode
 Technical Committee have told us, on behalf of the Committee, that
 they would not consider adding the number of precomposed characters
 required to support existing languages and scripts, much less new
 ones.  So this option may not be feasible.  Retaining combining
 characters without further global restrictions may leave us "stuck"
 at Unicode 3.2, leading either to incompatibility differences in
 applications that otherwise use a modern version of Unicode (while
 IDN remains at Unicode 3.2) or to painful transitions to new
 versions.

This wording makes it seem as if it is just some whim of the UTC thatprevents it from adding all possible combinations of sequences thatwould involve combining marks and then removing all combining marks.That is no whim; such a change would be a massive disruption for theencoding, somewhat akin to changing the encoding for English to be onthe syllable level rather than individual letters. Moreover, none of theother issues raised by the UTC appear to have been even considered. Isthere any way that we can set up a more effective method of communication?


=====

(I'll add one further comment to the above -- it is a complete nonsequitur to imply that the IETF can't move beyond Unicode 3.2 because ofcombining marks.)


Mark

Jeffrey Hutzelman wrote:

On Fri, 21 Apr 2006, Simon Josefsson wrote:

On the other hand, potential security issues caused by instability in
the original (erroneous) definition are at least as serious as the
potential incompatibilities caused by the change.

Can you expand on this?

Can you give an example of how a system that contain components that
all properly implement StringPrep and the NFKC from Unicode 3.2, that
is without the PR-29 fix, have a serious security issues?

I haven't seen any such example.  I believe examples can be
constructed when one StringPrep implementation in a system implement
the original Unicode 3.2 NFKC semantics and one component implement
the "fixed" NFKC.  If you don't see it, I can try to produce a
complete scenario.


The problem is that the original NFKC isn't just "different"; it's
_unstable_.  That is, there exist strings for which you get different
results depending on _how many times_ you apply NFC or NFKC.  Systems
which use normalization to ensure that a security-related identifier is
always represented in the same way may break in the face of an unstable
normalization.

Of course, this only happens for the same highly-unlikely sequences...

    incompatibilities, despite that the change is not meant to modify
    how the version of Unicode that IDNA reference is implemented.


Except I still think you're trying to put words in the UTC's mouth that it
did not intend to say.  As I understand it, a corrigendum like this _is_
intended to modify the existing standard, including existing references.
It's roughly the equivalent of RFC errata - they're saying "there's a bug
in the text; it says X but we always meant for the spec to be Y".

Now, you can argue that the IETF should take the position that any of its
specs which refer to Unicode 3.2 and were published prior to corrigendum
#5 should be construed to mandate the use of the original, flawed
algorithms for NFC and NFKC.  But you seem to be taking the position that
the _Unicode Technical Committee_ had that intent with respect to existing
users of its standard, and I don't think that's true.

I also don't think the IETF should explicitly mandate the use of the
flawed algorithm by implemntations of preexisting specs.  As it turns out,
one of the justifications for making the retroactive change to Unicode was
that a significant number of implementations, including whatever reference
or test implementation they used when writing the spec _and_ the sample
code which appears _in_ the spec, get it right.

-- Jeff


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf


_______________________________________________
Ietf mailing list
Ietf(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/ietf