Re: draft-newman-i18n-collation-09.txt just posted


Mark Davis writes:

The release of this is timely (we didn't get notified of a 07 or 08draft), since the Unicode Technical Committee is meeting next week,and can discuss it.
Could you indicate which of the items raised in the email of2006-02-21 from the Unicode Technical Committee have been addressedin this release (and if not accepted then why)? That would helpgreatly with the review. (I couldn't find any archive for discussionof draft-newman-i18n-comparator where that email could be publiclylinked from, so I am appending it at the end of this message.) At aquick glance, it appears that a number of comments have beenincorporated.


Lots. Some not. See below.

It is possible that some of my changes don't satisfy you. I hadconflicting requests for many things. Feel free to repeat, rephrase oradd arguments.

Mark
BTW, despite the subject of the message, the document is athttp://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.It helps to send out a link, especially if the name (comparator vscollation) is wrong ;-)


Mea culpa. My apologies.

...

To:   Network Working Group
Re:   draft-newman-i18n-comparator
Date:         2006-02-21
From:         Unicode Technical Committee
The Unicode Technical Committee has reviewed the documenthttp://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.While UTC is in favor of the goal, there are a number of problemswith the document. The main problems are outlined below. Once theseare addressed, then further review can continue.
    Details

      > 2.1 Definitions

        Content
The document needs to include the definitions of the technical termsused in the document, including all those that may not be familiarto implementers, such as "trichotomous" and "collation identifiers".In particular, the notion of a substring is /prima facie/ quitesimple, but there are complications that require a clear definition.The text in the document does not make clear that there may be morethan one match for a substring in a string, and that the matches canoverlap. It says "the starting offset", for example, when there maybe multiple ones.


Changed.

Moreover, language sensitive matches have additional complicationswhich need to be called out. For more information, seehttp://www.unicode.org/reports/tr10/#Searching


Not really changed. As I recall, I added a little bit of text.

        Format
If there is a "Definitions" section, readers have a reasonableexpectation that that section should contain all the requireddefinitions. However, a number of definitions are scattered withinthe text. One of two approaches should be taken
   1. Move all the definitions into this section.
   2. Remove the definitions section, but clearly call out in the text
      the definitions of  each terms on its own line.

Mixing these two styles is needlessly confusing for readers.


Not changed; I'm going by what confuses reviewers.

      > 2.4 Sort Keys
The use of the term "collation canonicalization" to refer to sortkeys is very misleading. ...

Changed; the text now speaks of sort keys. I'm afraid there still areinstances of the old term around, I found one today.

> 3.2
This specifies that clients that support disconnected operationshould not use wildcards while clients that provide collationoperations only when connected to the server may use wildcards.


This restrinction has been lifted.

The EBNF syntax shown in section 3.2 says that the collation-wildmust not exceed 255 characters total while the section 3.1 specifiesthat the collation name must not exceed 254 characters.


Brought into sync.

It seems having the same maximum possible length for both collationname and wildcard string would be desirable for actualimplementations.


I picked 254, not 255, but I confess I cannot remember why.

      > 4.2.1 Equality
It needs to be made clear that the return values are not physicallythe strings "match", etc. but enumerated values such as /equal/ and/not_equal/.


Changed. Also other similar changes.

One extremely important point is that for a given comparator, theequality function must be synchronized with the ordering function.

I've done this and all the other equivalences/connections/implications Icould see.

The term 'error' is also problematic, since what is really at issueis a question of domain. For all those strings in the domain, either'equal' or 'not_equal' should be returned from the equalityfunction. For any string not in the domain, 'undefined' should bereturned.

Not changed. Back in February, I agreed that "error" was not ideal, butdid not see "undefined" as better, and could not find a really aptterm. The collations were a little too well-defined in the "undefined"cases then.

However, in -10, I think they really will be undefined outside theirdomain, so I'll change to using "undefined" instead of "error". (I'mremoving the bits that fall back to i;octet.)

There is a typo at the 4'th line of the second paragraph of thesection 4.2 saying "... For example, an collation" which should bechanged to "... For example, a collation" instead.


Fixed.

      > 4.2.2 Substring

Prefix and suffix matching are not fully spelled out.


I think they are now.

The operations and their results must be clarified. And as notedbefore, it is very important to precisely define the substringoperations, especially the starting offset and ending offset. Italso must be clarified whether what is being asked for is the firstpossible matching location in the string, the last, or the nth one.

Partly changed. I didn't do the bits you ask for in the last sentence. Ican add an open issue.

      > 4.3.3 Ordering

> It MUST be transitive and trichotomous.

As above, these should be defined.

I did not, since I think this document is the wrong place to definethese terms.

The exposition in this section would be simpler if you also defined"reversible", whereby f(a,b) = less iff f(b,a) = greater.

The exposition changed enough as a result of other commens that Iisregarded this comment.

An 'undefined' value can be allowed if, as per equality above, itmeans that at least one of the operands is outside of the domain.The function then imposes a total order on all strings in thedomain; moreover, a wrapper can easily convert the function to atotal order over all strings by putting all items outside the domaineither below or above the ones in the domain -- or even excludingthem,/ at its choice./


I'm doing something like this in -10. (Removing the fallback to i;octet.)

[Note: it is very important to avoid the confusion between"identical" and "equal". According to a caseless compare, "Mark" and"mark" are equal; however, the strings are not identical.]


Changed all over the place.

[Either 'ordering function' or 'comparison function' should be usedconsistently, not sometimes 'collations'].


Changed.

      > 4.3.  Internal Canonicalization Algorithm

This section is difficult to understand.


Changed; I hope the new text is better.

      > 4.4.  Use of Lookup Tables

It is not at all clear what is meant by "customizable lookup tables".


Clarified and partly removed.

      > 4.5.  Multi-Value Attributes

This is very unclear.


Deleted.

This is a very important feature that needs to be spelled out indetail, and clearly reflected in the template for registration. Inparticular, the template should have provision for multipleattributes, with the ability to specify the acceptable operands forthat attribute. (See below). The specification of the operands couldbe either a list of values, or a regular expression (with the formerpreferred). Suggested regular expression syntax would be Perl or XMLSchema.

I asked Martin Dürst and you to provide a new DTD. Martin said okay, Idon't remember whether you answered. I think the DTD should come beforethis.

      > 5.1Character Encoding

   The protocol specification has to make sure that it is clear on which
   characters (rather than just octets) the collations are used.  This
   can be done by specifying the protocol itself in terms of characters
   (e.g. in the case of a query language), by specifying a single
   character encoding for the protocol (e.g.  UTF-8 [3]), or by
   carefully describing the relevant issues of character encoding
   labeling and conversion.  In the later case, details to consider
   include how to handle unknown charsets, any charsets which are
   mandatory-to-implement, any issues with byte-order that might apply,
   and any transfer encodings which need to be supported.

If a collation is able to advertise itself as being able to handle,say, SJIS and UTF-8, then there should a required description of aprotocol for indicating that and for communicating which encodingsare handled, and how it handles error conditions (such as a charsetoutside of those it can handle. Otherwise, it is difficult tounderstand how this paragraph would be applied in practice.


      > 5.3

The section 5.3 specifies:

    The protocol MUST specify how comparisons behave in the absence of
    explicit collation negotiation or when a collation of "*" is
    requested. The protocol MAY specify that the default collation
    used in such circumstances is sensitive to server configuration.

and the section 3.2 specifies:

    ... If the wildcard string matches multiple collations, the server
    SHOULD select the collation with the broadest scope (preferably
    international scope), the most recent table versions and the
    greatest number of supported operations. A single wildcard
    character ("*") refers to the application protocol collation
    behavior that would occur if no explicit negotiation were used.

These appear inconsistent.


Changed.

      7.5.  Example Initial Registry Summary
The sample registry would suffer a combinatorial explosion ifparameters are not handled differently.

...

This is the DTD issue.

> 11.  Security Considerations
This is insufficient. It should at least point to the problemsrelated in UCA and inhttp://www.unicode.org/reports/tr36/tr36-4.html (note that thatdocument has been approved by the UTC and will be posted as anapproved version soon.)


It now refers.

    General
One of the real problems with the IANA character registry is that theentries are underspecified. It quite often occurs that two vendorsimplement the same IANA charset conversion different ways, leadingto significant interoperability problems and text corruption. See,for example,http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
We have the real concern that this registry could lead down the same path.


Noted.

> collation, it has to say so
There are places where the text should be clarified, as to whether aMUST or SHOULD is implied; this is just an example.
> "comparator" vs "collator"

Either one term or the other should be used consistently.


Collator, now.

> Unicode 3.2
Unicode 3.2 is obsolete; the the reference versions for the CollationRegistry should be Unicode 5.0 and UCA 5.0, since those will beapproved and published by the time the Internet Application ProtocolCollation Registry has completed its review and been approved.

I'll update to the then-current versions immediately before submittingthe final draft as an RFC.

Because of the use of NamePrep, it is probably the case that Unicode3.2 also needs to be included, but strongly recommended for usageonly by protocols or systems dependent on NamePrep. Note that as ofUCA 4.0 and beyond, the version number of UCA is guaranteed to beidentical with the version number of Unicode that it is defined for.
> Versioning
This is tricky, and should be clarified. In many instances, it issufficient to use an unversioned collator, such as simply "UCA". Inother cases, there are requirements to use a specific version, or aversion of at least X. This needs to be described.

IETF documents should have only immutable references. Thus, I canreference "UCAv14", but not "UCA", because the latter moves to v15, v16and onwards.


Arnt