perl-unicode

Re: Unicode::Collate question

2003-12-06 01:30:05

Has anyone had a look at the OpenI18N/ICU locale data?

The locales there are all UTF-8 and have java rule based collation data, so
they *might* be useful for creating a more comprehensive (and accurate) set
of sort modules? The downside is this data is pretty rough ATM but does
seem to be improving slowly.

I guess p6 is going to use ICU as the basis for I18N - sure hope the APIs
are easier though :)

The syntax of collation customization (tailoring) in ICU
 ( http://oss.software.ibm.com/icu/userguide/Collate_Customization.html )
is character-based and may be more intuitive:

   for French:
       "[backwards 2]&A << \u00e6/e <<< \u00c6/E"

   for Spanish:
       "&N < n\u0303 <<< N\u0303"
       "&C < ch <<< Ch <<< CH"
       "&l < ll <<< Ll <<< LL"

However Unicode::Collate also allows linguistic tailoring.
Certainly its interface requires hard code of weights and
may be less user-friendly.

#!perl
use strict;
use warnings;
use Unicode::Collate;

our (@listEs, @listFr);

my $objEs = Unicode::Collate->new(
    entry => <<'ENTRY', # for allkeys-4.0.0.txt
0063 0068 ; [.0E6A.0020.0002.0063] # ch
0043 0068 ; [.0E6A.0020.0007.0043] # Ch
0043 0048 ; [.0E6A.0020.0008.0043] # Ch
006C 006C ; [.0F4C.0020.0002.006C] # ll
004C 006C ; [.0F4C.0020.0007.004C] # Ll
004C 004C ; [.0F4C.0020.0008.004C] # LL
006E 0303 ; [.0F69.0020.0002.006E] # ñ
004E 0303 ; [.0F69.0020.0008.004E] # Ñ
ENTRY

#    entry => <<'ENTRY', # for allkeys-3.1.1.txt
#0063 0068 ; [.0A46.0020.0002.0063] # ch
#0043 0068 ; [.0A46.0020.0007.0043] # Ch
#0043 0048 ; [.0A46.0020.0008.0043] # Ch
#006C 006C ; [.0B1C.0020.0002.006C] # ll
#004C 006C ; [.0B1C.0020.0007.004C] # Ll
#004C 004C ; [.0B1C.0020.0008.004C] # LL
#006E 0303 ; [.0B38.0020.0002.006E] # ñ
#004E 0303 ; [.0B38.0020.0008.004E] # Ñ
#ENTRY
);


my $objFr = Unicode::Collate->new(
    backwards => 2,

    entry => <<'ENTRY', # for allkeys-4.0.0.txt
00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae
00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE
ENTRY

#    entry => <<'ENTRY', # for allkeys-3.1.1.txt
#00E6 ; [.0A15.0020.0002.00E6][.0A65.0020.0002.00E6] # ae
#00C6 ; [.0A15.0020.0008.00C6][.0A65.0020.0008.00C6] # AE
#ENTRY
);

BEGIN {

@listEs = qw(
    cambio camelo camella camello Camerún cielo curso
    chico chile Chile CHILE chocolate
    espacio espanto español esperanza lama líquido luz
    llama Llama LLAMA llamar nos nueve ñu ojo
);

@listFr = (
  qw(
    cadurcien cæcum cÆCUM CæCUM CÆCUM caennais cæsium cafard
    coercitif cote côte Côte coté Coté côté Côté coter
    élève élevé gène gêne MÂCON maçon
    pèche PÈCHE pêche PÊCHE péché PÉCHÉ pécher pêcher
    relève relevé révèle révélé
    surélévation sûrement suréminent sûreté
    vice-consul vicennal vice-président vice-roi vicésimal),
  "vice versa", "vice-versa",
);

use Test;
plan tests => $#listEs + 2 + $#listFr + 2;

}

sub randomize { my %hash; @hash{(_at_)_} = (); keys %hash; } # ?!

for (my $i = 0; $i < $#listEs; $i++) {
    ok($objEs->lt($listEs[$i], $listEs[$i+1]));
}

for (my $i = 0; $i < $#listFr; $i++) {
    ok($objFr->lt($listFr[$i], $listFr[$i+1]));
}

our @randEs = randomize(@listEs);
our @sortEs = $objEs->sort(@randEs);

ok("@randEs" ne "@listEs");
ok("@sortEs" eq "@listEs");

our @randFr = randomize(@listFr);
our @sortFr = $objFr->sort(@randFr);

ok("@randFr" ne "@listFr");
ok("@sortFr" eq "@listFr");

__END__

Regards,
SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>