Re: draft-newman-i18n-collation-09.txt just posted
2006-05-09 15:56:48
The release of this is timely (we didn't get notified of a 07 or 08
draft), since the Unicode Technical Committee is meeting next week, and
can discuss it.
Could you indicate which of the items raised in the email of 2006-02-21
from the Unicode Technical Committee have been addressed in this release
(and if not accepted then why)? That would help greatly with the review.
(I couldn't find any archive for discussion of
draft-newman-i18n-comparator where that email could be publicly linked
from, so I am appending it at the end of this message.) At a quick
glance, it appears that a number of comments have been incorporated.
Mark
BTW, despite the subject of the message, the document is at
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt.
It helps to send out a link, especially if the name (comparator vs
collation) is wrong ;-)
BTW, it was pointed out to us that the original email shouldn't have
been sent to "Network Working Group", even though that is the name at
the top of
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-09.txt
Arnt Gulbrandsen wrote:
As far as I know, this addresses, ignores or adds open issues for all
requests. If something is ignored, that's because other people wanted
the opposite, or because I overlooked it when I went over all the mail
this week. I'm sorry about it in either case.
Review, please.
Arnt
=================
Mark Davis wrote:
To: Network Working Group
Re: draft-newman-i18n-comparator
Date: 2006-02-21
From: Unicode Technical Committee
The Unicode Technical Committee has reviewed the document
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-06.txt.
While UTC is in favor of the goal, there are a number of problems with
the document. The main problems are outlined below. Once these are
addressed, then further review can continue.
Details
> 2.1 Definitions
Content
The document needs to include the definitions of the technical terms
used in the document, including all those that may not be familiar to
implementers, such as "trichotomous" and "collation identifiers". In
particular, the notion of a substring is /prima facie/ quite simple,
but there are complications that require a clear definition. The text
in the document does not make clear that there may be more than one
match for a substring in a string, and that the matches can overlap.
It says "the starting offset", for example, when there may be multiple
ones.
Moreover, language sensitive matches have additional complications
which need to be called out. For more information, see
http://www.unicode.org/reports/tr10/#Searching
Format
If there is a "Definitions" section, readers have a reasonable
expectation that that section should contain all the required
definitions. However, a number of definitions are scattered within the
text. One of two approaches should be taken
1. Move all the definitions into this section.
2. Remove the definitions section, but clearly call out in the text
the definitions of each terms on its own line.
Mixing these two styles is needlessly confusing for readers.
> 2.4 Sort Keys
The use of the term "collation canonicalization" to refer to sort keys
is very misleading. The term "canonicalization" implies that the
results are still text in some fashion, whereas a sortkey is simply a
string of octets generated from a given string by a specific
comparator, whereby the binary comparison (ordering) of two sort keys
is guaranteed to match *that* comparator's compare function for the
original strings. The octets may have no readily discernable relation
to the original text. For example, the ICU sort keys generated for the
following strings are:
cote 2c 44 4e 30 01 08 01 08 00
côté 2c 44 4e 30 01 85 93 85 8d 01 0a 00
Αραβικά 5c 20 52 20 22 36 3a 20 01 80 8d 01 8f 0b 00
See
http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col
<http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=el&d_=en&x=col>
for other examples.
> 3.2
This specifies that clients that support disconnected operation should
not use wildcards while clients that provide collation operations only
when connected to the server may use wildcards.
It appears the restrictions are may not be really needed and the
restrictions may need to be deleted from the draft. Otherwise, it
would really helpful if the rationale behind the restrictions are
provided at the draft.
The EBNF syntax shown in section 3.2 says that the collation-wild must
not exceed 255 characters total while the section 3.1 specifies that
the collation name must not exceed 254 characters.
It seems having the same maximum possible length for both collation
name and wildcard string would be desirable for actual implementations.
> 4.2.1 Equality
It needs to be made clear that the return values are not physically
the strings "match", etc. but enumerated values such as /equal/ and
/not_equal/. The document could describe a notation used for them,
such as single quotes, since italic is not available in RFCs.
Similarly, the results of the ordering function should be specified as
an enumeration with three values: /less/, /equal/, /greater./ The
mapping actual API return values in implementations to these
enumerated values can be outside of the scope of this document. For
example, the mapping might take -1 onto /less/ in one implementation,
or anything negative onto /less/ in another implementation.
One extremely important point is that for a given comparator, the
equality function must be synchronized with the ordering function.
That is, it must return 'equal' if and only if the ordering function
returns 'equal'. Otherwise any coordinated usage of the functions will
fail. This also implies that either 'error' is allowed for both
functions or for neither.
The term 'error' is also problematic, since what is really at issue is
a question of domain. For all those strings in the domain, either
'equal' or 'not_equal' should be returned from the equality function.
For any string not in the domain, 'undefined' should be returned. That
avoids coherency problems. Then the requirements are clear:
* if A and B are in the domain, then the result of an equality
test is either /equal/ or /not_equal/
* if A or B (or both) are not in the domain, then the result of an
equality test is /undefined/.
There is a typo at the 4'th line of the second paragraph of the
section 4.2 saying "... For example, an collation" which should be
changed to "... For example, a collation" instead.
> 4.2.2 Substring
Prefix and suffix matching are not fully spelled out. The operations
and their results must be clarified. And as noted before, it is very
important to precisely define the substring operations, especially the
starting offset and ending offset. It also must be clarified whether
what is being asked for is the first possible matching location in the
string, the last, or the nth one.
> 4.3.3 Ordering
> It MUST be transitive and trichotomous.
As above, these should be defined. The exposition in this section
would be simpler if you also defined "reversible", whereby f(a,b) =
less iff f(b,a) = greater. Then the statement would be:
It MUST be transitive, trichotomous, and reversible.
>When the collation is used with a
"-" prefix, the result of the ordering function of the collation MUST
be reversed.
=> When the collation is used with a
"-" prefix, the result of the ordering function of the collation
when applied to two strings A and B MUST
be the same as the result with a "+" prefix applied to B and A.
An 'undefined' value can be allowed if, as per equality above, it
means that at least one of the operands is outside of the domain. The
function then imposes a total order on all strings in the domain;
moreover, a wrapper can easily convert the function to a total order
over all strings by putting all items outside the domain either below
or above the ones in the domain -- or even excluding them,/ at its
choice./
> In general, collations SHOULD NOT return "0" unless the two strings
are identical.
=> The ordering function MUST return 'equal' if and only if the equality
function returns 'equal'
[Note: it is very important to avoid the confusion between "identical"
and "equal". According to a caseless compare, "Mark" and "mark" are
equal; however, the strings are not identical.]
[Either 'ordering function' or 'comparison function' should be used
consistently, not sometimes 'collations'].
> 4.3. Internal Canonicalization Algorithm
This section is difficult to understand. It appears that goal is that
any registration must specify sufficient detail, both data and
algorithm, so as to enable someone to reproduce the results. But it is
not at all clear that that is the goal. And that would make the
registration require, in some cases, a huge accompanying document. To
duplicate the results of CLDR collators, for example, would require
the UCA specification, plus the LDML specification, plus all the
relevant data in the CLDR repository.
> 4.4. Use of Lookup Tables
It is not at all clear what is meant by "customizable lookup tables".
> 4.5. Multi-Value Attributes
This is very unclear. It describes attributes as applying to only
equality (since it only refers to "match" vs "no-match" (and
forgetting "error")).
This is a very important feature that needs to be spelled out in
detail, and clearly reflected in the template for registration. In
particular, the template should have provision for multiple
attributes, with the ability to specify the acceptable operands for
that attribute. (See below). The specification of the operands could
be either a list of values, or a regular expression (with the former
preferred). Suggested regular expression syntax would be Perl or XML
Schema.
> 5.1Character Encoding
The protocol specification has to make sure that it is clear on which
characters (rather than just octets) the collations are used. This
can be done by specifying the protocol itself in terms of characters
(e.g. in the case of a query language), by specifying a single
character encoding for the protocol (e.g. UTF-8 [3]), or by
carefully describing the relevant issues of character encoding
labeling and conversion. In the later case, details to consider
include how to handle unknown charsets, any charsets which are
mandatory-to-implement, any issues with byte-order that might apply,
and any transfer encodings which need to be supported.
If a collation is able to advertise itself as being able to handle,
say, SJIS and UTF-8, then there should a required description of a
protocol for indicating that and for communicating which encodings are
handled, and how it handles error conditions (such as a charset
outside of those it can handle. Otherwise, it is difficult to
understand how this paragraph would be applied in practice.
> 5.3
The section 5.3 specifies:
The protocol MUST specify how comparisons behave in the absence of
explicit collation negotiation or when a collation of "*" is
requested. The protocol MAY specify that the default collation
used in such circumstances is sensitive to server configuration.
and the section 3.2 specifies:
... If the wildcard string matches multiple collations, the server
SHOULD select the collation with the broadest scope (preferably
international scope), the most recent table versions and the
greatest number of supported operations. A single wildcard
character ("*") refers to the application protocol collation
behavior that would occur if no explicit negotiation were used.
These appear inconsistent.
7.5. Example Initial Registry Summary
The sample registry would suffer a combinatorial explosion if
parameters are not handled differently. For example, with CLDR
collations, there can be hundreds of locales, six different strength
settings; four different case-first settings; three different
alternate settings, backwards settings, normalization settings, case
level settings, hiragana settings, and numeric settings; plus a
variable-top setting which has a string as an operand. Registering the
combinations that people are allowed to use would be untenable.
http://www.unicode.org/draft/reports/tr35/tr35.html#Setting_Options
Instead, as remarked above, the allowable attribute values need to be
associated with the registered name in a machine-readable form.
> 11. Security Considerations
This is insufficient. It should at least point to the problems related
in UCA and in http://www.unicode.org/reports/tr36/tr36-4.html (note
that that document has been approved by the UTC and will be posted as
an approved version soon.)
General
One of the real problems with the IANA character registry is that the
entries are underspecified. It quite often occurs that two vendors
implement the same IANA charset conversion different ways, leading to
significant interoperability problems and text corruption. See, for
example, http://www.w3.org/Submission/japanese-xml/#ambiguity_of_yen.
We have the real concern that this registry could lead down the same path.
> collation, it has to say so
There are places where the text should be clarified, as to whether a
MUST or SHOULD is implied; this is just an example.
> "comparator" vs "collator"
Either one term or the other should be used consistently.
> Unicode 3.2
Unicode 3.2 is obsolete; the the reference versions for the Collation
Registry should be Unicode 5.0 and UCA 5.0, since those will be
approved and published by the time the Internet Application Protocol
Collation Registry has completed its review and been approved.
Because of the use of NamePrep, it is probably the case that Unicode
3.2 also needs to be included, but strongly recommended for usage only
by protocols or systems dependent on NamePrep. Note that as of UCA 4.0
and beyond, the version number of UCA is guaranteed to be identical
with the version number of Unicode that it is defined for.
> Versioning
This is tricky, and should be clarified. In many instances, it is
sufficient to use an unversioned collator, such as simply "UCA". In
other cases, there are requirements to use a specific version, or a
version of at least X. This needs to be described.
|
|