perl-unicode

Re: Problem using Unicode::UCD casefold and casespec.

2003-05-10 14:30:04

"Jarkko" == Jarkko Hietaniemi <jhi(_at_)iki(_dot_)fi> writes:
However the problem with this is that $casefold comes back as undef.

Jarkko> Because the U+09DC has no 'special folding' (nor 'special
Jarkko> casing', for that matter).  It has only the 'usual' cases
Jarkko> (which can be retrieved using charinfo()).

OK, thank you.

Another suggestion is therefore that the examples in the docs be
changed to use codepoints that do have special folding or casing.

Here's what I see now. I grepped a few values out of
CaseFolding-3.2.0.txt that seem to have case folding:

    $ grep ' F;' CaseFolding-3.2.0.txt | head | cut -f1 -d\;
    00DF
    0130
    0149
    01F0
    0390
    03B0
    0587
    1E96
    1E97
    1E98

and I put these into a short script:

    #!/usr/bin/perl -w
    use strict;
    use Unicode::UCD qw(casespec casefold charinfo);

    foreach my $cp qw(00DF 0130 0149 01F0 0390 03B0 0587 1E96 1E97 1E98){
        my $info = charinfo(hex($cp));
        die "$0: $cp has no charinfo.\n" unless defined $info;

        printf "U+$cp: %-53s fold=%d, spec=%d\n",
               $info->{name},
               defined casefold($cp) ? 1 : 0,
               defined casespec($cp) ? 1 : 0;
    }


and expected to see that casefold (at least) gave a defined value for
each. But instead I see the following output:


U+00DF: LATIN SMALL LETTER SHARP S                            fold=1, spec=1
U+0130: LATIN CAPITAL LETTER I WITH DOT ABOVE                 fold=0, spec=0
U+0149: LATIN SMALL LETTER N PRECEDED BY APOSTROPHE           fold=0, spec=0
U+01F0: LATIN SMALL LETTER J WITH CARON                       fold=1, spec=1
U+0390: GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS      fold=0, spec=0
U+03B0: GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS   fold=1, spec=1
U+0587: ARMENIAN SMALL LIGATURE ECH YIWN                      fold=0, spec=0
U+1E96: LATIN SMALL LETTER H WITH LINE BELOW                  fold=1, spec=1
U+1E97: LATIN SMALL LETTER T WITH DIAERESIS                   fold=1, spec=1
U+1E98: LATIN SMALL LETTER W WITH RING ABOVE                  fold=1, spec=1


Is there anything wrong here? If not, I guess there's something pretty
fundamental going here that I don't understand. Why would U+00DF have
folding information but U+0149 not have it?

Regards,
Terry.