Some applications are finding it necessary to read in the Unicode files
that mktables generates. For example, grepping through CPAN indicates
that Text::Unicode::Equivalents reads Decomposition.pl. This, and most
of the other generated files are marked for internal use only, because
we wish to reserve the right to change them around, etc. But
applications currently have no feasible alternative. Prior to 5.14, we
delivered the full Unicode db files that the Unicode consortium
publishes, and whose format is guaranteed not to change. But we dropped
those files in 5.14 to save disk space.
I'm proposing a new function Unicode::UCD::prop_invmap() to return the
contents of those files in a Unicode-centric way, so that applications
can use it and we can deprecate non-core use of our generated files.
The function returns an inversion map, which is a data structure more
used in the Unicode world than the Perl world. It consists of two
parallel arrays. I suppose a more Perl-centric data structure would be
an array of hashes, but the inversion map seems simpler to me to manipulate.
(This function would be in addition to the previously rfc'd function
Unicode::UCD::prop_invlist() which would return a list of all code
points that match a property-value.)
=pod
=head2 prop_invmap
C<prop_invmap> is used to get the complete mapping definition for the input
property, in the form of an inversion map. An inversion map consists of two
parallel arrays. One is an ordered list of code points that mark range
beginnings, and the other gives the value that all code points in the
corresponding range have. C<prop_invmap> is called with the name of the
desired property, and references to the two arrays, which it fills. For
example,
prop_invmap("Numeric_Value", \@numerics_ranges, \@numerics_maps);
will populate the arrays as shown below:
@numerics_ranges @numerics_maps Note
0x00 "NaN" NaN stands for "Not a Number"
0x30 0 DIGIT 0
0x31 1
0x32 2
...
0x37 7
0x38 8
0x39 9 DIGIT 9
0x3A "NaN"
0xB2 2 SUPERSCRIPT 2
0xB3 3 SUPERSCRIPT 2
0xB4 "NaN"
0xB9 1 SUPERSCRIPT 1
0xBA "NaN"
0xBC 0.25 VULGAR FRACTION 1/4
0xBD 0.5 VULGAR FRACTION 1/2
0xBE 0.75 VULGAR FRACTION 3/4
0xBF "NaN"
0x660 0 ARABIC-INDIC DIGIT ZERO
... ...
0x110000 undef
The second line means that the value for the code point 0x30 (which is
"DIGIT
0") is 0. The first line means that all code points in the range from
0x00 to
0x2F (which is 0x30 (from the second line) - 1) have the value "NaN".
The final line means that the value for all code points above the legal
Unicode maximum code point have the value C<undef> (not the string
"u-n-d-e-f").
The arrays completely specify the mappings for all possible code points.
The special string S<C<"E<lt>code pointE<gt>">> is used to specify that
the value of a code point is itself. For example, the beginnings of the
arrays for
prop_invmap("Uppercase_Mapping", \@uppers_ranges, \@uppers_maps);
look like this:
@uppers_ranges @uppers_maps Note
0 "<code point>"
97 65 'a' maps to 'A'
98 66 'b' => 'B'
99 67 'c' => 'C'
...
120 88 'x' => 'X'
121 89 'y' => 'Y'
122 90 'z' => 'Z'
123 "<code point>"
181 924 MICRO SIGN => Greek Cap MU
182 "<code point>"
223 [ 83 83 ] SHARP S => 'SS'
224 192
The first line means that the uppercase of code point 0 is 0, of 1 is 1, ...
of 96 is 96. Without the C<"E<lt>code_pointE<gt>"> notation, every code
point
would have to have an entry. This would mean that the arrays would each
have
more than a million entries to list just the legal Unicode code points!
In some properties some code points map to a sequence of multiple code
points.
For those, the corresponding entries in the map array are not scalars, but
references to anonymous arrays containing the ordered list of code points
mapped to, as shown in the example above for 223.
The "Name" property map includes entries such as
CJK UNIFIED IDEOGRAPH-<code point>
This means that the name for the code point is "CJK UNIFIED IDEOGRAPH-"
with the code point (expressed in hexadecimal) appended to it. Also, the
notation "E<lt>hangul syllableE<gt>" occurs in this property, meaning
that the
name is algorithmically calculated. These names can be generated via the
function C<charnames::viacode>().
The "Decomposition_Mapping" property also uses "E<lt>hangul
syllableE<gt>" for
those code points whose decomposition is algorithmically calculated. These
can be generated via the function C<Unicode::Normalize::NFD>(). This
property
contains many occurrences of code points whose mappings are ordered lists of
other code points.
The return value is
C<undef> if the property is unknown;
C<s> if all the elements of the map array are simple scalars;
C<n> for the Name property, which has the complications described above;
C<d> for the Decomposition_Mapping property (complications already
described);
otherwise C<c> if some of map array elements are S<C<"E<lt>code
pointE<gt>">>;
and C<l> if additionally some are lists of code points.
A binary search can be used to quickly find a code point in the inversion
list, and hence its corresponding mapping.
=cut