On 07/01/2011 11:49 AM, Karl Williamson wrote:
On 07/01/2011 10:40 AM, BobH wrote:
Karl Williamson wrote:
I'm trying to think of a good name. Best so far is
UCD::get_prop_invlist()
Hm, "get" normally isn't needed.
How about something simpler such as UCD::charlist()
Bob
I think not having prop in the name is potentially misleading, and it
actually isn't a list of the chars. It's an inversion list that is
readily convertible into such a list.
I've mostly written and tested it. But here is my proposed API to see
how people like it (or not); (I'm still open to a better name, but I do
thing that the name needs to have the requirements I mentioned above):
=pod
=head2 prop_invlist
C<prop_invlist> returns an inversion list (see below) that defines all the
code points for the Unicode property given by the input parameter string:
say join ", ", prop_invlist("Any");
0, 1114112
An empty list is returned if the given property is unknown.
L<perluniprops|perluniprops/Properties accessible through \p{} and \P{}>
gives
the list of properties that this function accepts, as well as all the
possible
forms for them. Note that many properties can be specified in a compound
form, such as
say join ", ", prop_invlist("Script=Shavian");
66640, 66688
say join ", ", prop_invlist("ASCII_Hex_Digit=No");
0, 48, 58, 65, 71, 97, 103
say join ", ", prop_invlist("ASCII_Hex_Digit=Yes");
48, 58, 65, 71, 97, 103
Inversion lists are a compact way of specifying Unicode properties. The 0th
item in the list is the lowest code point that has the property-value. The
next item is the lowest code point after that one that does NOT have the
property-value. And the next item after that is the lowest code point after
that one that has the property-value, and so on. Put another way, each
element in the list gives the beginning of a range that has the
property-value
(for even numbered elements), or doesn't have the property-value (for odd
numbered elements).
In the final example above, the first ASCII Hex digit is code point 48, the
character "0", and all code points from it through 57 (a "9") are ASCII hex
digits. Code points 58 through 64 aren't, but 65 (an "A") through 70
(an "F")
are, as are 97 ("a") through 102 ("f"). 103 starts a range of code points
that aren't ASCII hex digits. That range extends to infinity, which on your
computer can be found in the variable C<$Unicode::UCD::MAX_CP>.
It is a simple matter to expand out an inversion list to a full list of all
code points that have the property-value:
my @invlist = prop_invlist("My Property");
die "empty" unless @invlist;
my @full_list;
for (my $i = 0; $i < @invlist; $i += 2) {
my $upper = ($i + 1) < @invlist
? $invlist[$i+1] - 1 # In range
: $Unicode::UCD::MAX_CP; # To infinity. You may want
# to stop earlier
for my $j ($invlist[$i] .. $upper) {
print $upper, ": ", $j, "\n";
push @full_list, $j;
}
}
=cut