Re: Caseless and accentless string comparisons


On Mon, 12 May 2003, SADAHIRO Tomoyuki wrote:

On Mon, 12 May 2003 07:46:37 -0400 (EDT)
Jungshik Shin <jshin(_at_)mailaps(_dot_)org> wrote:

  You meant NFD, didn't you?  BTW, the proposed update of UTS #10
( http://www.unicode.org/reports/tr10/tr10-10.html) may be of interest
as well. BTW, this is yet a draft and as such needs some refining (for
instance, Hangul Jamo handling is not satisfactory.). Is there any Perl
module that implements Unicode collation as described in UTS #10 or the
collation algorithm specified in ISO 14651-to-be (as it stands) ?


Unicode::Collate is for UCA by UTS #10.


  Thanks for writing and maintaining it.

If I understanded it correctly,
Trailing Weights would not make a Hangul syllable
(LVT, LV, LLVT, etc. and with mark) being one collation grapheme,
as long as each of L, V, and T is non-ignorable
and UCA lacks a protocol that allows a sequence of plural
non-ignorable CE's to be regarded as one collation grapheme.


  The easiest and cleanest way to deal with Hangul Jamos is to
preprocess them through additional normalization (I wrote about the
other day and you implemented) and assing primary weights to _only_
basic jamos. That is,

  1. remove all cluster Jamos from allkeys.txt [1]
  2. assing primary weights to _basic_ Jamos in such a way that
     L > V > T for all basic L's, V's and T's.
  3. decompose all cluster Jamos to sequences of basic Jamos
  4. At every syllable boundary, insert
    the syllable terminator with the primary weight
     smaller than any weight given to T.

Step 1 is in a sense similar to reordering of some Thai/Lao letter
sequences in that it's outside the now frozen Unicode normalization but
simplfies/is required for the collation.


As for the fact that Jamo sequences that are graphemes are not
collation graphemes, I think that's what we have to pay for multi-level
representability of Korean script. To meet needs of most ordinary Koreans
who regard syllables as units (when they use '..' in RE, they expect it
to match two syllables instead of two Jamos), UCA and Unicode RE can be
tailored to a restricted repertoire to which no Jamo belongs.  However,
in some other cases (e.g. lingustic research, intelligent search engine),
'Jamos' (especially basic Jamos) are units in Korean script. In Korean
cell phone keypad, we go even further down and decompose all vowels into
sequences of  'dot', 'horinzontal bar' and 'vertical bar'.


Jungshik


 [1] These jamos should not have been encoded at all in Unicode.
     Encoding them was a mistake. Removing them
     even from the compatibility (de)composition Instead of upgrading
     comaptibility (de)composition to canonical (de)composition
     is another blunder.