perl-unicode

Re: How to use Unicode::Collate in multilinguage apps?

2004-03-28 02:30:05
I think, for a script representing usually one language,
allkeys.txt defines fairly acceptable collation order.
For example, order of hiragana and katakana is approximately
compliant with the custom of the Japanese language.

In contrast, for a script representing many languages
(say, the Latin script), tailoring may be often necessary.

E.g. 'Ä' is sorted as A-umlaut (sometimes as 'AE') in German,
and as one of additional letters ordered after 'Z' in some
northern-european languages.

Yup, that is the case in Finnish and Swedish, and Danish and Norwegian
do similar things with their "a" and "o" equivalents.  This means it is
logically impossible to sort a list containing both German and Swedish
names "right". Many European languages sort some consonant+h after the
base consonant as a separate "letter", and so forth.  And I believe many
the CJK languages have in fact several (and differing) customary sorting
sorters.

Even when staying within a single language one must decide whether one
does things like "dictionary sorting" (spaces etc. removed), and how do
lowercase and uppercase sort (A < B < a, A < a < B, a < A < B, or
a == A < B), what one does with things like articles, etc.

So one must always either accept "a good enough" sorting, or one must
customize more or less heavily.

But according to Unicode default collation, 'Ä' is ordered
as a modified 'A' and equal to 'A' at the primary level.


-- 
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this 
special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen