perl-unicode

Unicode Collation Algorithm

2001-08-03 20:10:56
Hello, everyone.

A Perl implementation of Unicode Collation Algorithm
would be as following?

(1) SYNOPSIS

to collate strings using Unicode Collation Algorithm (UTR #10).
  see http://www.unicode.org/unicode/reports/tr10/

use Sort::UCA; # or other name, ex. UCA, Unicode::Collation, etc. 

#construct
$uca = Sort::UCA->new(%arguments);

#sort
@sorted = $uca->sort(@not_sorted);

#compare
$result = $uca->cmp($a, $b); # returns 1, 0, or -1. 

# other methods: ex. getSortKey, cmpSortKey, etc.

(2) DESCRIPTION

# arguments to be passed the method new in.

%arguments = (
  alternate => 'shifted',
  rearrangement => \(_at_)charList,
  backwards => $levelNumber, # or \(_at_)levelNumbers
  entry => $element,
  preprocess => \&preprocess,
  ignoreName => qr/regex/,
  ignoreChar => qr/regex/,
  overrideCJK => \&overrideCJK,
  overrideHangul => \&overrideHangul,
  table => $filename,
  upper_before_lower => $bool,
);

  [1] alternate
   -- see 3.2.2 Alternate Weighting, UTR #10.

    alternate => 'shifted', 'blanked' or 'non-ignorable'.

    * concerning Collation elements marked with an asterisk 
      (they include many punctuations and symbols)
    

  [2] rearrangement
   -- see 3.1.3 Rearrangement, UTR #10.

    Characters that are not coded in logical order and to be rearranged.

    By default, 
    rearrangement => [ 0x0E40..0x0E44, 0x0EC0..0x0EC4],

  [3] backwards
   -- see 3.1.2 French Accents, UTR #10.

     backwards => $levelNumber, # or \(_at_)levelNumbers

     * weights in reverse order
       ex. level 2 (accent ordering) in French

  [4] entry
   -- see 3.1 Linguistic Features;
          3.2.1 File Format, UTR #10.

 override a default order or add a new element

  entry => <<'ELEMENT', # use the UCA file format
00E6 ; [.06D9.0020.0002] [.073A.0020.0002] # ligature ae equiv. to <a e>
0063 0068 ; [.0707.0020.0002]      # "ch" in traditional Spanish
ELEMENT

  or like following?

  entry =>
   {
     "\x{00E6}" => [
       [0x06D9, 0x0020, 0x0002],
       [0x073A, 0x0020, 0x0002],
     ],
     "ch" => [0x0707, 0x0020, 0x0002],
   },

  [5] preprocess
   -- see 5.1 Preprocessing, UTR #10.

  * a coderef to preprocess before the formation of sort keys
  
   ex. preprocess => sub {
           my $str = shift;
           $str =~ s/\b(?:an?|the)\s+//g;
           $str;
        };
    # dropping English articles, such as "a" or "the".
    # Then, "the pen" is before "a pencil".

  [6] ignoreName or ignoreChar
   -- see 6.3.4 Reducing the Repertoire, UTR #10.

  ignoreName => qr/\bDINGBAT\b/,
     # Elements the name of which matches the regex are ignored.

  ignoreChar => qr/^(?:\p{InDingbat}|\p{Lm})$/,
     # Elements which matches the regex are ignored.

When 'a' and 'e' are ignored,
'element' is equal to 'lament' (or 'lmnt').

But, it'd be better to ignore characters
unfamiliar to you (and maybe never used).

  [7] overrideCJK or overrideHangul
   -- see 7.1 Derived Collation Elements, UTR #10.

By default, mapping of CJK Unified Ideographs
uses the Unicode codepoint order
and Hangul Syllables are decomposed into Hangul Jamo.

The mapping of CJK Unified Ideographs
or Hangul Syllables may be overrided.

  ex. CJK Unified Ideographs in the JIS codepoint order.

  [8] table
   -- see 3.2 Default Unicode Collation Element Table, UTR #10.

By default,
   http://www.unicode.org/unicode/reports/tr10/allkeys.txt.
is used as an element table.

You can use another element table if desired.

  [9] upper_before_lower
   -- see 6.6 Case Comparisons;
          7.3.1 Tertiary Weight Table, UTR #10.

By default, lowercase is before uppercase.
If upper_before_lower is true, this is reversed.

regards, SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>