perl-unicode

Re: Need: list of Unicode characters that have canonical decompositions.

2011-06-28 12:31:55
On 06/27/2011 08:04 PM, BobH wrote:
Karl Williamson wrote:

 > I'm presuming you need this not for a one-time only thing, but to be
 > able to run this program over and over.

Yes -- this is for a module that will be usable in a number of
situations. See
http://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.

The current implementation cheats by accessing unicore/Decomposition.pl
exactly the same way Unicode::UCD does.

 > You can always download UnicodeData.txt from the Unicode web site.

Yes I can -- and certainly have done for my personal use. But including
that file (or some derivative) in a general purpose module would mean
that it wouldn't necessarily have the same Unicode version as the Perl
installation into which my module might be installed. And besides, the
information I need is already in the Perl core -- though supposedly not
usable.

 > In a regular expression,
 > \p{Dt= can} (Decomposition_Type=Canonical) will match all characters
 > that you want.

Yes, I understand that I can test a character to see if it has a
particular decomposition, but I'm not sure I understand how to use a
regex to generate a complete list of characters with decompositions.

 > I'm thinking that 5.16 will have the stringification
 > of that regex include the list you want, but not in 5.14, and
 > stringification is not necessarily fixed either.
 >
 > I could easily write a new function for UCD that returns a list of
 > all code points that have a given property.

That is an interesting offer, and I think this should be given serious
consideration. I'm sure my little module isn't the only one that, as we
go into the future, would benefit from such a function.

Thanks for your reply, Karl.

Bob


If I did this, I would be tempted to have it return an inversion list, instead of an array of every code point that matches the property. Such an array could be potentially length 1,114,112. The largest possible inversion list is potentially half that, but the largest one that matches a Unicode property is around length 700, and yours would be somewhat over 200 entries. That is why inversion lists are often used for Unicode because they compactly represent the Unicode properties.

An inversion list is an array.  An example is:
5, 101, 116, 120, ...

This represents 5..100, 116..119 ...

The 0th element gives the first code point that is in the property; the next element gives the first code point after that one that's not in the property, and so forth. Each succeeding element marks the beginning of a range that is/isn't in the property, inverting the is/isnt each time.

It is a simple matter to convert an inversion list into a true array or hash of every code point that matches.

My question to you is would that be acceptable to you, do you think? I hate to return an enormous array by default when the application doesn't really need it.