Re: Sort::UCA 0.04 - Unicode Collation Algorithm


On Mon, 13 Aug 2001 10:07:42 -0500
Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> wrote:

On Mon, Aug 13, 2001 at 10:35:32PM +0900, SADAHIRO Tomoyuki wrote:


Hello, everyone.

Sort::UCA 0.04 has been uploaded on CPAN.

snip

To Do: conformance tests of Unicode 3.1.1 Beta
 (at present it's DRAFT).


When it passes I will probably grab it in into the 5.8.0-to-be


The result on Perl v5.7.2:
  cf. http://www.unicode.org/Public/BETA/UCA/CollationTest.html

Failed Test Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/CT_NI.t              95940 2814  29.34%  67795-95940
t/CT_S.t               95940 2814  29.34%  67795-95940
Failed 2/2 test scripts, 0.00% okay. 56292/191880 subtests failed, 70.66% okay.

These failures are due to miscalculation of
weights for unassigned characters (cf. 7.1.2 Legal code points, UTR #10)
in the CollationTest files.
I've reported on it to errata(_at_)unicode(_dot_)org

This is the script used for the above test.
(This is ONLY for your information; NOT a patch to perl.)

##BEGIN##
diff -urN dummy/CT_NI.t t/CT_NI.t
--- dummy/CT_NI.t       Thu Jan 01 09:00:00 1970
+++ t/CT_NI.t   Wed Aug 15 18:55:38 2001
@@ -0,0 +1,28 @@
+use strict;
+use Test;
+use warnings;
+use Sort::UCA 0.05;
+
+BEGIN { plan tests => 95940 }
+
+open FH, "<CollationTest_NON_IGNORABLE.txt" or die $@;
+my $UCA = Sort::UCA->new( alternate => "non-ignorable" ) or die $@;
+
+my $preKey  = "";
+my $preUTF8 = "";
+
+while(<FH>){
+    my($stdKey);
+    chomp;
+    s/(\[.*\])// and $stdKey = $1;
+    my $r = $_;
+    s/;.*//;
+    my @u        = Sort::UCA::_getHexArray($_);
+    my $curUTF8  = pack('U*', @u);
+    my $curKey   = $UCA->viewSortKey($curUTF8);
+    my $expect   = $curKey ne $preKey;
+    my $result   = $UCA->cmp($curUTF8, $preUTF8);
+    $preKey      = $curKey;
+    $preUTF8     = $curUTF8;
+    ok($result == $expect && $curKey eq $stdKey);
+}
diff -urN dummy/CT_S.t t/CT_S.t
--- dummy/CT_S.t        Thu Jan 01 09:00:00 1970
+++ t/CT_S.t    Wed Aug 15 18:51:44 2001
@@ -0,0 +1,28 @@
+use strict;
+use warnings;
+use Test;
+use Sort::UCA 0.05;
+
+BEGIN { plan tests => 95940 }
+
+open FH, "<CollationTest_SHIFTED.txt" or die $@;
+my $UCA = Sort::UCA->new(  ) or die $@;
+
+my $preKey  = "";
+my $preUTF8 = "";
+
+while(<FH>){
+    my($stdKey);
+    chomp;
+    s/(\[.*\])// and $stdKey = $1;
+    my $r = $_;
+    s/;.*//;
+    my @u        = Sort::UCA::_getHexArray($_);
+    my $curUTF8  = pack('U*', @u);
+    my $curKey   = $UCA->viewSortKey($curUTF8);
+    my $expect   = $curKey ne $preKey;
+    my $result   = $UCA->cmp($curUTF8, $preUTF8);
+    $preKey      = $curKey;
+    $preUTF8     = $curUTF8;
+    ok($result == $expect && $curKey eq $stdKey);
+}
##END##

(how about Unicode::Sort as the name?)


Unicode::Sort is also good, but,
Unicode::Collate might consist better with Unicode::Normalize.

 cf. Unicode Normalization Forms => Unicode::Normalize
     Unicode Collation Algorithm => Unicode::Collate

Regards, SADAHIRO Tomoyuki