On Fri, Feb 15, 2002 at 01:21:33PM -0500, John A.Walsh wrote:
Hello,
I can't get character classes in regular experession to work with
Unicode characters. I've tried both putting both the literal Unicode
characters and the \x{XX} notation within square brackets [] to create
a character class, but it's not working. I've tried with both the
developer release of Perl 5.7.2 and the daily build from 2002/02/13.
Here's an example of some code that isn't working for me:
---
#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;
Rule #1: Do not use "use utf8". It's irrelevant.
Amendment: "use utf8" is useful in one case and one case only--
if you *script* is in UTF-8, you can say "use utf8" and then
use UTF-8 in places like variable and subroutine names.
(Now I'm talking Perl 5.7. In Perl 5.6 it was different.)
$string = encode_utf8("f\x{e9}lise");
encode_utf8() will correctly transform the \x{e9} in the UTF-8 bytes
\x{c3}\x{a9}.
$string =~ s/f[e\x{e8}\x{e9}\x{ea}\x{eb}]lise/SUCCESS/; #does not match
It does not because you no more have the byte \x{e8} in your $string,
you have its UTF-8 bytes \x{c3}\x{a9}.
print "new string: $string\n";
---
With another approach, this works:
#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;
$string = encode_utf8("f\x{e9}lise");
$regex = encode_utf8("f\x{e9}lise");
$string =~ s/$regex/SUCCESS/; #matches
This works because now the byte sequences match.
print "new string: $string\n";
While this does not:
#!/usr/local/bin/perl5.7.2
use Encode;
use utf8;
$string = encode_utf8("f\x{e9}lise");
$regex = encode_utf8("f[\x{e9}\x{e8}]lise");
You shouldn't convert regular expressions with encode_utf8().
What happens now is that the character class in the $regex
gets to contain three bytes: \x{c3} (twice), \x{a9}, and \x{a8}.
$string =~ s/$regex/SUCCESS/; #does not match
print "new string: $string\n";
Should examples 1 and 3 be working? Thanks for listening.
In all three examples you weren't actually using Unicode from
Perl's perspective. You were converting 8-bit encoding bytes
to UTF-8 bytes.
You can take a peek at "perluniintro", which is a new document
(after 5.7.2), hopefully clarifying things a bit.
http://www.iki.fi/jhi/perluniintro.pod
Some of the features it talks only work in post-5.7.2 Perl, but
most of the 'theory' should be applicable to 5.7.2.
John
| John A. Walsh, Manager, Electronic Text Technologies
| Digital Library Program / University Information Technology Services (UITS)
| Indiana University, 1320 East Tenth Street, Bloomington, IN 47405
| Voice:812-855-8758 Fax:812-856-2062 <mailto:jawalsh(_at_)indiana(_dot_)edu>
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen