perl-unicode

Re: Unicode::Collate question

2003-12-01 09:30:07
Le 1 déc. 03, à 16:46, Jarkko Hietaniemi a écrit :

Thank you both for your replies. What about sorting words in one particular language, is Perl's sort() good enough? I'm wondering, since language isn't
one of sort()'s arguments.

First we need to define "good enough"... again, if you are sorting
"simple" English or Hawaiian, you are probably fine.  But as soon
as your "words" contain real-life complications like

        - letters like é or or ö or æ or ...
- beyond-Latin-1-letters like Ă or Ł or Б or א or अ or ぁ or ... - peoples' names
        - acronyms and the like
        - do all the characters matter or just the letters
        - sorting mixed letters and digits
        - Roman numbers

you are on your own. For the first item the use of the locale pragma can help as long as your data is 8-bit and in one locale. As soon as data becomes Unicode,
Perl will as far as I know ignore localeness for sorting.

If you find yourself wanting some complex sorting, look into CPAN, what you can find from search.cpan.org with "sort", for example Sort::ArbBiLex might
be useful.

Ok, this is in line with what how I understood this paragraph in perluniintro:

The short answer is that by default, Perl compares strings ("lt", "le", "cmp", "ge", "gt") based only on the code points of the char- acters. In the above case, the answer is "after", since 0x00C1 >
           0x00C0.

So is it just by chance that these French words are accurately sorted?

% perl -Mutf8 -e 'binmode(STDOUT, ":utf8"); print join " ", sort qw(côte côté cote coté)'
cote coté côte côté

Thanks,
--
Eric Cholet

<Prev in Thread] Current Thread [Next in Thread>