perl-unicode

Re: Unicode::Collate question

2003-12-01 11:30:05
Ok, this is in line with what how I understood this paragraph in perluniintro:

The short answer is that by default, Perl compares strings ("lt", "le", "cmp", "ge", "gt") based only on the code points of the char- acters. In the above case, the answer is "after", since 0x00C1 >
           0x00C0.

So is it just by chance that these French words are accurately sorted?

I think a "qualified yes" here is in order...

% perl -Mutf8 -e 'binmode(STDOUT, ":utf8"); print join " ", sort qw(côte côté cote coté)'
cote coté côte côté

Is this the famous French "backwards accents" rule in action?
(http://www-clips.imag.fr/geta/gilles.serasset/tri-du-francais.html)
(no, I don't speak French)

But in this case, with those particular words, I think ISO Latin 1 (none
of the characters are beyond ISO Latin 1) just "happens" to work right.
o < ô, and e < é.

Some more links (database related since they have had to think about these things for years already) that hopefully explain some of the problems related to "linguistic sorting":

http://www.engin.umich.edu/caen/wls/software/oracle/server.901/a90236/ ch4.htm http://developer.mimer.com/documentation/html_92/ Mimer_SQL_Engine_DocSet/Mimer_Concepts14.html


Thanks,
--
Eric Cholet


--
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> http://www.iki.fi/jhi/ "There is this special
biologist word we use for 'stable'.  It is 'dead'." -- Jack Cohen


<Prev in Thread] Current Thread [Next in Thread>