New API available to access Unicode DB, and RFC on changes to it.

Perl 5.15.5, now available, has additions to Unicode::UCD in it to allowunfettered programmatic access to the Unicode character data base. TheAPI is quite similar to what was sent out for comment on this listseveral months ago; several changes were required as a result of lessonslearned during implementation. This email has an attachment that is anhtml file giving (with a yellow background) the additions since 5.14 tothe pod.

As a result of this API, it is deprecated to read the files inlib/unicore directly. These may change, and the API will be stable asof 5.16. In the meantime, I'd be happy to have people use this, andgive me get feedback on any problems with the API or bugs in the code.

And, I do wish to change the API already for certain of the outputs inprop_invmap() in order to make them more compact. For example, take theuc() property. What it currently returns is this (taken from theattached pod):


 @$uppers_ranges_ref    @$uppers_maps_ref   Note
       0                 "<code point>"
      97                     65          'a' maps to 'A'
      98                     66          'b' => 'B'
      99                     67          'c' => 'C'
      ...
     120                     88          'x' => 'X'
     121                     89          'y' => 'Y'
     122                     90          'z' => 'Z'
     123                "<code point>"
     181                    924          MICRO SIGN => Greek Cap MU
     182                "<code point>"
     ...
    0x0149              [ 0x02BC 0x004E ]
    0x014A              "<code point>"
    0x014B                 0x014A
     ...


That could be more compactly represented as:
 @$uppers_ranges_ref    @$uppers_maps_ref   Note
       0                      0
      97                    -32          'a-z' maps to 'A'-'Z'
     123                      0
     181                    743          MICRO SIGN => Greek Cap MU
     182                      0
     ...
    0x0149              [ 0x02BC 0x004E ]
    0x014A                    0
    0x014B                   -1
     ...

where the map is to be added to the code point to get the final result.Thus only one entry is needed to represent all 26 ASCII lower casecharacter mappings, instead of 26 entries. This makes such tablessignificantly smaller. The Perl core currently does a linear searchthrough them looking for mappings. Using the more compact versionswould speed that up significantly. The percentage gain is 30-40%, andwith the mapping for decimal digits the result is a full order ofmagnitude smaller, making the search much much faster.

Returning the delta only makes sense on a few tables, ones that whosemap is code points, or the decimal digits.

As you can see in the example for 0x0149, I wouldn't propose to makedeltas of the lists, even though that is inconsistent. They generallyrequire special handling.
<<< text/html; name="ucd.htm": Unrecognized >>>