perl-unicode

Re: Problem using Unicode::UCD casefold and casespec.

2003-05-10 19:30:04

On Sat, 10 May 2003 22:45:52 +0200
terry(_at_)eatoni(_dot_)com (terry jones) wrote:

and I put these into a short script:

    #!/usr/bin/perl -w
    use strict;
    use Unicode::UCD qw(casespec casefold charinfo);

    foreach my $cp qw(00DF 0130 0149 01F0 0390 03B0 0587 1E96 1E97 1E98){
      my $info = charinfo(hex($cp));
      die "$0: $cp has no charinfo.\n" unless defined $info;

      printf "U+$cp: %-53s fold=%d, spec=%d\n",
             $info->{name},
             defined casefold($cp) ? 1 : 0,
             defined casespec($cp) ? 1 : 0;
    }


and expected to see that casefold (at least) gave a defined value for
each. But instead I see the following output:


U+00DF: LATIN SMALL LETTER SHARP S                            fold=1, spec=1
U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE                 fold=0, spec=0
U+0149: LATIN SMALL LETTER N PRECEDED BY APOSTROPHE           fold=0, spec=0
U+01F0: LATIN SMALL LETTER J WITH CARON                       fold=1, spec=1
U+0390: GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS      fold=0, spec=0
U+03B0: GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS   fold=1, spec=1
U+0587: ARMENIAN SMALL LIGATURE ECH YIWN                      fold=0, spec=0
U+1E96: LATIN SMALL LETTER H WITH LINE BELOW                  fold=1, spec=1
U+1E97: LATIN SMALL LETTER T WITH DIAERESIS                   fold=1, spec=1
U+1E98: LATIN SMALL LETTER W WITH RING ABOVE                  fold=1, spec=1


Is there anything wrong here? If not, I guess there's something pretty
fundamental going here that I don't understand. Why would U+00DF have
folding information but U+0149 not have it?

'0130', '0149', '0390', '0587' match /^\d+$/, and others don't.

Try hex($cp) or "U+$cp" or "0x$cp" instead $cp.

cf.
http://www.perldoc.com/perl5.8.0/lib/Unicode/UCD.html#Code-Point-Arguments

I don't think this behavior of _getcode() would be consistent
enough, though.

SADAHIRO Tomoyuki