perl-unicode

5.8.1 perlre man page: [:punct:] vs. \p{IsPunct}

2003-11-02 10:30:04

I just happened to notice that the perlre man page describes the 
POSIX "[:punct:]" character class as being equivalent to the unicode 
"\p{IsPunct}" character class.

I haven't tried to track down the respective standards documents for
POSIX and Unicode to see whether these classes are _supposed_ to be
equivalent over the printable ASCII character set, but when I test them
in Perl 5.8.1, they are _not_ equivalent, as the following snippet will
demonstrate:

for $x ( 0x20 .. 0x7e ) { 
    $_ = chr( $x );
    $res = ( /[[:punct:]]/ ) ? "matches  :punct:" : "is not a :punct:";
    $res .= ( /\p{IsPunct}/ ) ? " matches  {IsPunct}" : " fails on {IsPunct}";
    printf( " 0x%x (%3d.) %s %s\n", $x, $x, $_, $res ) if ( $res =~ /matches/ );
}

The differences involve these nine characters:  $ + < = > ^ ` | ~

Except for the back-tick (`), I wouldn't be surprised if POSIX and 
Unicode are supposed to differ on these points, so maybe it's just a 
matter of fixing the perlre man page.  (I'm not sure yet what the 
behavior of [:punct:] is supposed to be on non-ASCII punctuation 
characters in Unicode -- maybe the man page should clarify this too.)

        Dave Graff


<Prev in Thread] Current Thread [Next in Thread>