perl-unicode

RFC: API to access Unicode db files

2011-07-21 10:04:49
Some applications are finding it necessary to read in the Unicode files that mktables generates. For example, grepping through CPAN indicates that Text::Unicode::Equivalents reads Decomposition.pl. This, and most of the other generated files are marked for internal use only, because we wish to reserve the right to change them around, etc. But applications currently have no feasible alternative. Prior to 5.14, we delivered the full Unicode db files that the Unicode consortium publishes, and whose format is guaranteed not to change. But we dropped those files in 5.14 to save disk space.

I'm proposing a new function Unicode::UCD::prop_invmap() to return the contents of those files in a Unicode-centric way, so that applications can use it and we can deprecate non-core use of our generated files.

The function returns an inversion map, which is a data structure more used in the Unicode world than the Perl world. It consists of two parallel arrays. I suppose a more Perl-centric data structure would be an array of hashes, but the inversion map seems simpler to me to manipulate.

(This function would be in addition to the previously rfc'd function Unicode::UCD::prop_invlist() which would return a list of all code points that match a property-value.)

=pod

=head2 prop_invmap

C<prop_invmap> is used to get the complete mapping definition for the input
property, in the form of an inversion map.  An inversion map consists of two
parallel arrays.  One is an ordered list of code points that mark range
beginnings, and the other gives the value that all code points in the
corresponding range have.  C<prop_invmap> is called with the name of the
desired property, and references to the two arrays, which it fills.  For
example,

 prop_invmap("Numeric_Value", \@numerics_ranges, \@numerics_maps);

will populate the arrays as shown below:

 @numerics_ranges  @numerics_maps        Note
        0x00             "NaN"          NaN stands for "Not a Number"
        0x30             0              DIGIT 0
        0x31             1
        0x32             2
        ...
        0x37             7
        0x38             8
        0x39             9              DIGIT 9
        0x3A             "NaN"
        0xB2             2              SUPERSCRIPT 2
        0xB3             3              SUPERSCRIPT 2
        0xB4             "NaN"
        0xB9             1              SUPERSCRIPT 1
        0xBA             "NaN"
        0xBC             0.25           VULGAR FRACTION 1/4
        0xBD             0.5            VULGAR FRACTION 1/2
        0xBE             0.75           VULGAR FRACTION 3/4
        0xBF             "NaN"
        0x660            0              ARABIC-INDIC DIGIT ZERO
        ...              ...
     0x110000            undef

The second line means that the value for the code point 0x30 (which is "DIGIT 0") is 0. The first line means that all code points in the range from 0x00 to
0x2F (which is 0x30 (from the second line) - 1) have the value "NaN".
The final line means that the value for all code points above the legal
Unicode maximum code point have the value C<undef> (not the string "u-n-d-e-f").

The arrays completely specify the mappings for all possible code points.

The special string S<C<"E<lt>code pointE<gt>">> is used to specify that
the value of a code point is itself.  For example, the beginnings of the
arrays for

 prop_invmap("Uppercase_Mapping", \@uppers_ranges, \@uppers_maps);

look like this:

 @uppers_ranges    @uppers_maps       Note
       0          "<code point>"
      97              65          'a' maps to 'A'
      98              66          'b' => 'B'
      99              67          'c' => 'C'
      ...
     120              88          'x' => 'X'
     121              89          'y' => 'Y'
     122              90          'z' => 'Z'
     123         "<code point>"
     181             924          MICRO SIGN => Greek Cap MU
     182         "<code point>"
     223           [ 83 83 ]      SHARP S => 'SS'
     224             192

The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...
of 96 is 96. Without the C<"E<lt>code_pointE<gt>"> notation, every code point would have to have an entry. This would mean that the arrays would each have
more than a million entries to list just the legal Unicode code points!

In some properties some code points map to a sequence of multiple code points.
For those, the corresponding entries in the map array are not scalars, but
references to anonymous arrays containing the ordered list of code points
mapped to, as shown in the example above for 223.

The "Name" property map includes entries such as

 CJK UNIFIED IDEOGRAPH-<code point>

This means that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
with the code point (expressed in hexadecimal) appended to it.  Also, the
notation "E<lt>hangul syllableE<gt>" occurs in this property, meaning that the
name is algorithmically calculated.  These names can be generated via the
function C<charnames::viacode>().

The "Decomposition_Mapping" property also uses "E<lt>hangul syllableE<gt>" for
those code points whose decomposition is algorithmically calculated.  These
can be generated via the function C<Unicode::Normalize::NFD>(). This property
contains many occurrences of code points whose mappings are ordered lists of
other code points.

The return value is
C<undef> if the property is unknown;
C<s> if all the elements of the map array are simple scalars;
C<n> for the Name property, which has the complications described above;
C<d> for the Decomposition_Mapping property (complications already described); otherwise C<c> if some of map array elements are S<C<"E<lt>code pointE<gt>">>;
and C<l> if additionally some are lists of code points.

A binary search can be used to quickly find a code point in the inversion
list, and hence its corresponding mapping.

=cut

<Prev in Thread] Current Thread [Next in Thread>