Re: draft-newman-i18n-collation-09.txt just posted

The release of this is timely (we didn't get notified of a 07 or 08draft), since the Unicode Technical Committee is meeting next week, andcan discuss it.

Could you indicate which of the items raised in the email of 2006-02-21from the Unicode Technical Committee have been addressed in this release(and if not accepted then why)? That would help greatly with the review.(I couldn't find any archive for discussion ofdraft-newman-i18n-comparator where that email could be publicly linkedfrom, so I am appending it at the end of this message.) At a quickglance, it appears that a number of comments have been incorporated.


Mark

BTW, despite the subject of the message, the document is athttp://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.It helps to send out a link, especially if the name (comparator vscollation) is wrong ;-)

BTW, it was pointed out to us that the original email shouldn't havebeen sent to "Network Working Group", even though that is the name atthe top ofhttp://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt


Arnt Gulbrandsen wrote:

As far as I know, this addresses, ignores or adds open issues for allrequests. If something is ignored, that's because other people wantedthe opposite, or because I overlooked it when I went over all the mailthis week. I'm sorry about it in either case.
Review, please.

Arnt


=================

Mark Davis wrote:

To:     Network Working Group
Re:     draft-newman-i18n-comparator
Date:   2006-02-21
From:   Unicode Technical Committee
The Unicode Technical Committee has reviewed the documenthttp://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.While UTC is in favor of the goal, there are a number of problems withthe document. The main problems are outlined below. Once these areaddressed, then further review can continue.
    Details


      > 2.1 Definitions


        Content
The document needs to include the definitions of the technical termsused in the document, including all those that may not be familiar toimplementers, such as "trichotomous" and "collation identifiers". Inparticular, the notion of a substring is /prima facie/ quite simple,but there are complications that require a clear definition. The textin the document does not make clear that there may be more than onematch for a substring in a string, and that the matches can overlap.It says "the starting offset", for example, when there may be multipleones.
Moreover, language sensitive matches have additional complicationswhich need to be called out. For more information, seehttp://www.unicode.org/reports/tr10/#Searching
        Format
If there is a "Definitions" section, readers have a reasonableexpectation that that section should contain all the requireddefinitions. However, a number of definitions are scattered within thetext. One of two approaches should be taken
   1. Move all the definitions into this section.
   2. Remove the definitions section, but clearly call out in the text
      the definitions of  each terms on its own line.

Mixing these two styles is needlessly confusing for readers.


      > 2.4 Sort Keys
The use of the term "collation canonicalization" to refer to sort keysis very misleading. The term "canonicalization" implies that theresults are still text in some fashion, whereas a sortkey is simply astring of octets generated from a given string by a specificcomparator, whereby the binary comparison (ordering) of two sort keysis guaranteed to match *that* comparator's compare function for theoriginal strings. The octets may have no readily discernable relationto the original text. For example, the ICU sort keys generated for thefollowing strings are:
cote    2c 44 4e 30 01 08 01 08 00
côté  2c 44 4e 30 01 85 93 85 8d 01 0a 00
Αραβικά       5c 20 52 20 22 36 3a 20 01 80 8d 01 8f 0b 00
Seehttp://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col<http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col>for other examples.
> 3.2
This specifies that clients that support disconnected operation shouldnot use wildcards while clients that provide collation operations onlywhen connected to the server may use wildcards.
It appears the restrictions are may not be really needed and therestrictions may need to be deleted from the draft. Otherwise, itwould really helpful if the rationale behind the restrictions areprovided at the draft.
The EBNF syntax shown in section 3.2 says that the collation-wild mustnot exceed 255 characters total while the section 3.1 specifies thatthe collation name must not exceed 254 characters.
It seems having the same maximum possible length for both collationname and wildcard string would be desirable for actual implementations.
      > 4.2.1 Equality
It needs to be made clear that the return values are not physicallythe strings "match", etc. but enumerated values such as /equal/ and/not_equal/. The document could describe a notation used for them,such as single quotes, since italic is not available in RFCs.Similarly, the results of the ordering function should be specified asan enumeration with three values: /less/, /equal/, /greater./ Themapping actual API return values in implementations to theseenumerated values can be outside of the scope of this document. Forexample, the mapping might take -1 onto /less/ in one implementation,or anything negative onto /less/ in another implementation.
One extremely important point is that for a given comparator, theequality function must be synchronized with the ordering function.That is, it must return 'equal' if and only if the ordering functionreturns 'equal'. Otherwise any coordinated usage of the functions willfail. This also implies that either 'error' is allowed for bothfunctions or for neither.
The term 'error' is also problematic, since what is really at issue isa question of domain. For all those strings in the domain, either'equal' or 'not_equal' should be returned from the equality function.For any string not in the domain, 'undefined' should be returned. Thatavoids coherency problems. Then the requirements are clear:
    * if A and B are in the domain, then the result of an equality
      test is either /equal/ or /not_equal/
    * if A or B (or both) are not in the domain, then the result of an
      equality test is /undefined/.
There is a typo at the 4'th line of the second paragraph of thesection 4.2 saying "... For example, an collation" which should bechanged to "... For example, a collation" instead.
      > 4.2.2 Substring
Prefix and suffix matching are not fully spelled out. The operationsand their results must be clarified. And as noted before, it is veryimportant to precisely define the substring operations, especially thestarting offset and ending offset. It also must be clarified whetherwhat is being asked for is the first possible matching location in thestring, the last, or the nth one.
      > 4.3.3 Ordering

> It MUST be transitive and trichotomous.
As above, these should be defined. The exposition in this sectionwould be simpler if you also defined "reversible", whereby f(a,b) =less iff f(b,a) = greater. Then the statement would be:
    It MUST be transitive, trichotomous, and reversible.

>When the collation is used with a
   "-" prefix, the result of the ordering function of the collation MUST
   be reversed.

=> When the collation is used with a
"-" prefix, the result of the ordering function of the collationwhen applied to two strings A and B MUST
   be the same as the result with a "+" prefix applied to B and A.
An 'undefined' value can be allowed if, as per equality above, itmeans that at least one of the operands is outside of the domain. Thefunction then imposes a total order on all strings in the domain;moreover, a wrapper can easily convert the function to a total orderover all strings by putting all items outside the domain either belowor above the ones in the domain -- or even excluding them,/ at itschoice./
> In general, collations SHOULD NOT return "0" unless the two stringsare identical.
=> The ordering function MUST return 'equal' if and only if the equality 
function returns 'equal'
[Note: it is very important to avoid the confusion between "identical"and "equal". According to a caseless compare, "Mark" and "mark" areequal; however, the strings are not identical.]
[Either 'ordering function' or 'comparison function' should be usedconsistently, not sometimes 'collations'].
      > 4.3.  Internal Canonicalization Algorithm
This section is difficult to understand. It appears that goal is thatany registration must specify sufficient detail, both data andalgorithm, so as to enable someone to reproduce the results. But it isnot at all clear that that is the goal. And that would make theregistration require, in some cases, a huge accompanying document. Toduplicate the results of CLDR collators, for example, would requirethe UCA specification, plus the LDML specification, plus all therelevant data in the CLDR repository.
      > 4.4.  Use of Lookup Tables

It is not at all clear what is meant by "customizable lookup tables".


      > 4.5.  Multi-Value Attributes
This is very unclear. It describes attributes as applying to onlyequality (since it only refers to "match" vs "no-match" (andforgetting "error")).
This is a very important feature that needs to be spelled out indetail, and clearly reflected in the template for registration. Inparticular, the template should have provision for multipleattributes, with the ability to specify the acceptable operands forthat attribute. (See below). The specification of the operands couldbe either a list of values, or a regular expression (with the formerpreferred). Suggested regular expression syntax would be Perl or XMLSchema.
      > 5.1Character Encoding

   The protocol specification has to make sure that it is clear on which
   characters (rather than just octets) the collations are used.  This
   can be done by specifying the protocol itself in terms of characters
   (e.g. in the case of a query language), by specifying a single
   character encoding for the protocol (e.g.  UTF-8 [3]), or by
   carefully describing the relevant issues of character encoding
   labeling and conversion.  In the later case, details to consider
   include how to handle unknown charsets, any charsets which are
   mandatory-to-implement, any issues with byte-order that might apply,
   and any transfer encodings which need to be supported.
If a collation is able to advertise itself as being able to handle,say, SJIS and UTF-8, then there should a required description of aprotocol for indicating that and for communicating which encodings arehandled, and how it handles error conditions (such as a charsetoutside of those it can handle. Otherwise, it is difficult tounderstand how this paragraph would be applied in practice.
      > 5.3

The section 5.3 specifies:

    The protocol MUST specify how comparisons behave in the absence of
    explicit collation negotiation or when a collation of "*" is
    requested. The protocol MAY specify that the default collation
    used in such circumstances is sensitive to server configuration.

and the section 3.2 specifies:

    ... If the wildcard string matches multiple collations, the server
    SHOULD select the collation with the broadest scope (preferably
    international scope), the most recent table versions and the
    greatest number of supported operations. A single wildcard
    character ("*") refers to the application protocol collation
    behavior that would occur if no explicit negotiation were used.

These appear inconsistent.


      7.5.  Example Initial Registry Summary
The sample registry would suffer a combinatorial explosion ifparameters are not handled differently. For example, with CLDRcollations, there can be hundreds of locales, six different strengthsettings; four different case-first settings; three differentalternate settings, backwards settings, normalization settings, caselevel settings, hiragana settings, and numeric settings; plus avariable-top setting which has a string as an operand. Registering thecombinations that people are allowed to use would be untenable.
http://www.unicode.org/draft/reports/tr35/tr35.html#Setting_Options
Instead, as remarked above, the allowable attribute values need to beassociated with the registered name in a machine-readable form.
> 11.  Security Considerations
This is insufficient. It should at least point to the problems relatedin UCA and in http://www.unicode.org/reports/tr36/tr36-4.html (notethat that document has been approved by the UTC and will be posted asan approved version soon.)
    General
One of the real problems with the IANA character registry is that theentries are underspecified. It quite often occurs that two vendorsimplement the same IANA charset conversion different ways, leading tosignificant interoperability problems and text corruption. See, forexample, http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
We have the real concern that this registry could lead down the same path.

> collation, it has to say so
There are places where the text should be clarified, as to whether aMUST or SHOULD is implied; this is just an example.
> "comparator" vs "collator"

Either one term or the other should be used consistently.

> Unicode 3.2
Unicode 3.2 is obsolete; the the reference versions for the CollationRegistry should be Unicode 5.0 and UCA 5.0, since those will beapproved and published by the time the Internet Application ProtocolCollation Registry has completed its review and been approved.
Because of the use of NamePrep, it is probably the case that Unicode3.2 also needs to be included, but strongly recommended for usage onlyby protocols or systems dependent on NamePrep. Note that as of UCA 4.0and beyond, the version number of UCA is guaranteed to be identicalwith the version number of Unicode that it is defined for.
> Versioning
This is tricky, and should be clarified. In many instances, it issufficient to use an unversioned collator, such as simply "UCA". Inother cases, there are requirements to use a specific version, or aversion of at least X. This needs to be described.