perl-unicode

Re: specifying character types

1999-06-30 13:52:07
Ed Batutis writes:
: I'm looking for info on a good way to specify character types in Perl
: beyond what is supported via the locale mechanism in place today. I'd
: like to use  "symbolic" representations like \s, etc., but there aren't
: any that mean things like "Han characters" or "full-pitch digits" etc.
: I think that using hard-coded character code ranges is a bad thing over
: the longer run because it ties my Perl code to very specific
: implementations of character storage.
: 
: In the short run I can create a module with strings of hex sequences
: that mean "full-pitch digits" etc. in various encodings. Is there a
: better way?
: 
: Is anyone designing a long term solution? Is adding new escapes to Perl
: a good thing? Or is there something else in the works?

The utf8 module in recent development versions of Perl will handle this
sort of thing.  The utf8 pragma is just a way to tell Perl to use wide
characters.  It just so happens that the default encoding is Unicode,
but that's merely convention.  All the Unicode character properties are
defined by tables, and you could switch tables just by saying something
like

    use utf8 'big5';

That presumes someone has written tables for 'big5'.

If you're using an existing set of tables such as Unicode, you can add
additional tables by defining a method that returns ranges of
characters in (or not in) the character class.  All these tables,
predefined or not, are accessed within regular expressions via the new
\p{PropName} escape.  Here's an example.

    use utf8;
    sub IsRomanNumeral {
        return <<'END';
        0043
        0044
        0049
        004C
        004D
        0056
        0058
        0063
        0064
        0069
        006C
        006D
        0076
        0078
        2160    2182
    END
    }

    if (/^\p{IsRomanNumeral}*$/) { print "Possible Roman Numeral\n" }

For more about the utf8 pragma, install the latest Perl and say
"perldoc utf8".  For more about the tables that drive Unicode, look at
all the *.pl files in the lib/unicode subdirectory.  The files contain
subroutines that return ranges of characters as above.  The utf8 module
worries about efficiently translating these to bitmaps, so you don't
have to.

One thing that is not clear from examining these files is that
property classes may be defined in terms of one another.  So you could
define

    sub DiSp {
        return "+utf8::IsDigit\n+utf8::IsSpace\n";
    }

to define a property class that matches a digit or a space.  The
existing properties such as \w, \D, \s, etc. are all constructed this
way internally.

: Sorry if I missed something obvious, I'm a newbie to Perl.

No need to apologize.  This stuff is obviously not yet obvious yet.

Larry

<Prev in Thread] Current Thread [Next in Thread>