perl-unicode

Re: Need: list of Unicode characters that have canonical decompositions.

2011-06-27 21:04:25
Karl Williamson wrote:

> I'm presuming you need this not for a one-time only thing, but to be
>  able to run this program over and over.

Yes -- this is for a module that will be usable in a number of situations. See http://search.cpan.org/~bhallissy/Text-Unicode-Equivalents-0.05/.

The current implementation cheats by accessing unicore/Decomposition.pl exactly the same way Unicode::UCD does.

> You can always download UnicodeData.txt from the Unicode web site.

Yes I can -- and certainly have done for my personal use. But including that file (or some derivative) in a general purpose module would mean that it wouldn't necessarily have the same Unicode version as the Perl installation into which my module might be installed. And besides, the information I need is already in the Perl core -- though supposedly not usable.

> In a regular expression,
> \p{Dt= can} (Decomposition_Type=Canonical) will match all characters
>  that you want.

Yes, I understand that I can test a character to see if it has a particular decomposition, but I'm not sure I understand how to use a regex to generate a complete list of characters with decompositions.

> I'm thinking that 5.16 will have the stringification
> of that regex include the list you want, but not in 5.14, and
> stringification is not necessarily fixed either.
>
> I could easily write a new function for UCD that returns a list of
> all code points that have a given property.

That is an interesting offer, and I think this should be given serious consideration. I'm sure my little module isn't the only one that, as we go into the future, would benefit from such a function.

Thanks for your reply, Karl.

Bob