perl-unicode

Re: Need: list of Unicode characters that have canonical decompositions.

2011-07-06 15:43:23
On 07/01/2011 11:49 AM, Karl Williamson wrote:
On 07/01/2011 10:40 AM, BobH wrote:
Karl Williamson wrote:


I'm trying to think of a good name. Best so far is
UCD::get_prop_invlist()


Hm, "get" normally isn't needed.

How about something simpler such as UCD::charlist()

Bob


I think not having prop in the name is potentially misleading, and it
actually isn't a list of the chars. It's an inversion list that is
readily convertible into such a list.


I've mostly written and tested it. But here is my proposed API to see how people like it (or not); (I'm still open to a better name, but I do thing that the name needs to have the requirements I mentioned above):


=pod

=head2 prop_invlist

C<prop_invlist> returns an inversion list (see below) that defines all the
code points for the Unicode property given by the input parameter string:

 say join ", ", prop_invlist("Any");
 0, 1114112

An empty list is returned if the given property is unknown.

L<perluniprops|perluniprops/Properties accessible through \p{} and \P{}> gives the list of properties that this function accepts, as well as all the possible
forms for them.  Note that many properties can be specified in a compound
form, such as

 say join ", ", prop_invlist("Script=Shavian");
 66640, 66688

 say join ", ", prop_invlist("ASCII_Hex_Digit=No");
 0, 48, 58, 65, 71, 97, 103

 say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
 48, 58, 65, 71, 97, 103

Inversion lists are a compact way of specifying Unicode properties.  The 0th
item in the list is the lowest code point that has the property-value.  The
next item is the lowest code point after that one that does NOT have the
property-value.  And the next item after that is the lowest code point after
that one that has the property-value, and so on.  Put another way, each
element in the list gives the beginning of a range that has the property-value
(for even numbered elements), or doesn't have the property-value (for odd
numbered elements).

In the final example above, the first ASCII Hex digit is code point 48, the
character "0", and all code points from it through 57 (a "9") are ASCII hex
digits. Code points 58 through 64 aren't, but 65 (an "A") through 70 (an "F")
are, as are 97 ("a") through 102 ("f").  103 starts a range of code points
that aren't ASCII hex digits.  That range extends to infinity, which on your
computer can be found in the variable C<$Unicode::UCD::MAX_CP>.

It is a simple matter to expand out an inversion list to a full list of all
code points that have the property-value:

 my @invlist = prop_invlist("My Property");
 die "empty" unless @invlist;
 my @full_list;
 for (my $i = 0; $i < @invlist; $i += 2) {
    my $upper = ($i + 1) < @invlist
                ? $invlist[$i+1] - 1      # In range
                : $Unicode::UCD::MAX_CP;  # To infinity.  You may want
                                          # to stop earlier
    for my $j ($invlist[$i] .. $upper) {
        print $upper, ": ", $j, "\n";
        push @full_list, $j;
    }
 }

=cut

<Prev in Thread] Current Thread [Next in Thread>