perl-unicode

Re: Two Unicode Support Issues

1999-12-10 12:26:41
Daniel Yacob writes:
: Greetings All,
: 
: I encountered a case where utf8 did not work as expected and thought I
: should report it here.  The problem occured with the 5_62 development
: release:
: 
: 
: #!/usr/bin/perl
: 
: use utf8;
: 
: foreach $i (a..b) {
:   print "$i\n";
: }
: 
: __END__
: 
: 
: the above worked fine of course, it is when I changed 'a' to 0x1200 and
: 'b' to 0x137C (in utf8 form) that perl spat out some "bad character
: error".  In other contexts I encountered no problems.

We can certainly make ranges work on Unicode characters.  The question
arised, though, how we treat ranges analogous to "aa".."zz".  If we
made a rule that derives a-to-z-ness from the consistency of the
character properties, then your range above would still break, because
U+1200 .. U+135A are Lo, but U+135B .. U+1360 are undefined, and
U+1361 is Po.

We probably need some kind of IsRangeStopper property that can be
set on the fly.

: The next issue I encountered when using \p{InEthiopic} which give a
: positive response for anything in the range 0x1200 - 0x137F.  While this
: is valid for the "Ethiopic Range" in Unicode not everything in the range
: is valid Ethiopic.  There are a number of undefined positions in the
: field, around 37 or so that I had wished to avoid.
: 
: I was lead to modify the In/Ethiopic.pl script to step around the
: undefined characters.

That's the wrong place to modify it.  That file is generated, and when
it gets regenerated your changes will be clobbered.

: What is the policy here?  What was the original
: intention of the "In" property?  I think this problem must come up often
: with other scripts.

The intent of the "In" properties is only to reflect the contents of
the lib/unicode/Blocks.txt that comes from the Unicode Consortium.

I think definedness would be an "Is" rather than an "In".

Or maybe we don't actually need an IsDefined, since if we had a way
of getting at the lib/unicode/Category.pl table, it would naturally
serve that purpose.  (Since all character properties must be "true"
in the Perl sense.  I doubt they're' going to come up with a property
name of "0".)

Whatever, you can always write a method to define your character
properties.  You don't have to rely on the tables in lib/unicode.
Something like this:

    sub MyEthiopic {
        return <<'END';
    1200        135A
    1361        137C
    END
    }

    /\p{MyEthiopic}/;

Or we could even write a generalized sub generator that goes through
the tables and combines InWhatever with Category info to produce
character sets on the fly:

    *MyEthiopic = gen_property_sub("InEthiopic and IsPrint");

    /\p{MyEthiopic}/;

Note to implementors:  It occurs to me that, despite the fact that the
MyEthiopic method will be looked up in the caller's package, once it is
defined by SWASHNEW, it's defined globally.  This strikes me as
potentially problematic.  We might need a convention that when people
say /\p{foo}/ they mean /\p{MyPackage::foo}/ if it's defined in
MyPackage (or pretended to be by derivation?), and otherwise back off
to the global definition.  As long as the disambiguation code is in
SWASHNEW it shouldn't impact performance.

Of course, my brain could be in sideways this morning...

Larry

<Prev in Thread] Current Thread [Next in Thread>