perl-unicode

New API available to access Unicode DB, and RFC on changes to it.

2011-11-21 14:42:47
Perl 5.15.5, now available, has additions to Unicode::UCD in it to allow unfettered programmatic access to the Unicode character data base. The API is quite similar to what was sent out for comment on this list several months ago; several changes were required as a result of lessons learned during implementation. This email has an attachment that is an html file giving (with a yellow background) the additions since 5.14 to the pod.

As a result of this API, it is deprecated to read the files in lib/unicore directly. These may change, and the API will be stable as of 5.16. In the meantime, I'd be happy to have people use this, and give me get feedback on any problems with the API or bugs in the code.

And, I do wish to change the API already for certain of the outputs in prop_invmap() in order to make them more compact. For example, take the uc() property. What it currently returns is this (taken from the attached pod):

 @$uppers_ranges_ref    @$uppers_maps_ref   Note
       0                 "<code point>"
      97                     65          'a' maps to 'A'
      98                     66          'b' => 'B'
      99                     67          'c' => 'C'
      ...
     120                     88          'x' => 'X'
     121                     89          'y' => 'Y'
     122                     90          'z' => 'Z'
     123                "<code point>"
     181                    924          MICRO SIGN => Greek Cap MU
     182                "<code point>"
     ...
    0x0149              [ 0x02BC 0x004E ]
    0x014A              "<code point>"
    0x014B                 0x014A
     ...


That could be more compactly represented as:
 @$uppers_ranges_ref    @$uppers_maps_ref   Note
       0                      0
      97                    -32          'a-z' maps to 'A'-'Z'
     123                      0
     181                    743          MICRO SIGN => Greek Cap MU
     182                      0
     ...
    0x0149              [ 0x02BC 0x004E ]
    0x014A                    0
    0x014B                   -1
     ...

where the map is to be added to the code point to get the final result. Thus only one entry is needed to represent all 26 ASCII lower case character mappings, instead of 26 entries. This makes such tables significantly smaller. The Perl core currently does a linear search through them looking for mappings. Using the more compact versions would speed that up significantly. The percentage gain is 30-40%, and with the mapping for decimal digits the result is a full order of magnitude smaller, making the search much much faster.

Returning the delta only makes sense on a few tables, ones that whose map is code points, or the decimal digits.

As you can see in the example for 0x0149, I wouldn't propose to make deltas of the lists, even though that is inconsistent. They generally require special handling.
<<< text/html; name="ucd.htm": Unrecognized >>>
<Prev in Thread] Current Thread [Next in Thread>