Re: Character classes with Unicode

On Fri, Feb 15, 2002 at 01:21:33PM -0500, John A.Walsh wrote:

Hello,

I can't get character classes in regular experession to work with
Unicode characters.  I've tried both putting both the literal Unicode
characters and the \x{XX} notation within square brackets [] to create
a character class, but it's not working.  I've tried with both the
developer release of Perl 5.7.2 and the daily build from 2002/02/13.

Here's an example of some code that isn't working for me:
---
#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;


Rule #1: Do not use "use utf8".  It's irrelevant.

        Amendment: "use utf8" is useful in one case and one case only--
        if you *script* is in UTF-8, you can say "use utf8" and then
        use UTF-8 in places like variable and subroutine names.

        (Now I'm talking Perl 5.7.  In Perl 5.6 it was different.)

$string = encode_utf8("f\x{e9}lise");


encode_utf8() will correctly transform the \x{e9} in the UTF-8 bytes
\x{c3}\x{a9}.

$string =~ s/f[e\x{e8}\x{e9}\x{ea}\x{eb}]lise/SUCCESS/; #does not match


It does not because you no more have the byte \x{e8} in your $string,
you have its UTF-8 bytes \x{c3}\x{a9}.

print "new string: $string\n";
---

With another approach, this works:

#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;

$string = encode_utf8("f\x{e9}lise");
$regex = encode_utf8("f\x{e9}lise");
$string =~ s/$regex/SUCCESS/; #matches


This works because now the byte sequences match.

print "new string: $string\n";

While this does not:

#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;

$string = encode_utf8("f\x{e9}lise");
$regex = encode_utf8("f[\x{e9}\x{e8}]lise");


You shouldn't convert regular expressions with encode_utf8().
What happens now is that the character class in the $regex
gets to contain three bytes: \x{c3} (twice), \x{a9}, and \x{a8}.

$string =~ s/$regex/SUCCESS/; #does not match
print "new string: $string\n";

Should examples 1 and 3 be working?  Thanks for listening.


In all three examples you weren't actually using Unicode from
Perl's perspective.  You were converting 8-bit encoding bytes
to UTF-8 bytes.

You can take a peek at "perluniintro", which is a new document
(after 5.7.2), hopefully clarifying things a bit.

http://www.iki.fi/jhi/perluniintro.pod

Some of the features it talks only work in post-5.7.2 Perl, but
most of the 'theory' should be applicable to 5.7.2.

John
| John A. Walsh, Manager, Electronic Text Technologies
| Digital Library Program / University Information Technology Services (UITS)
| Indiana University, 1320 East Tenth Street, Bloomington, IN 47405
| Voice:812-855-8758 Fax:812-856-2062 <mailto:jawalsh(_at_)indiana(_dot_)edu>


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen