perl-i18n

Unicode support & regexp character classes

2002-10-30 06:46:05

hi

we're considering migration from Perl v5.005_03 to v5.8.0. one of the new
features we're interested is the Unicode support in v5.8.0. i've been
experimenting a bit with the new Unicode support and regexps but i'm a bit
confused how this functionality should be used.

i'd like to be able to match iso-8859-15 and Unicode characters with the
regexp character classes such as \w. when exactly is a string searched as
a sequence of Unicode characters? according to perlretut a string is
searched as a Unicode string if the regexp contains Unicode characters. is
this the only case? for example the following statement seems to suggest
that matching is done in Unicode mode if the string is a Unicode string
but regexp is not:

print "match\n" if ("\x{0185}\x{0227}\x{0213}" =~ m/^([\w]{3})$/);

or does this only match the first three bytes of the UTF-8 string, not the
whole three characters?

another issue is how to encode a string from platforms native encoding to
UTF-8. i'm trying to use Encode::encode_utf8($s) for converting strings
from native to UTF-8. this function doesn't, however, set the UTF-8 flag
for the encoded string so matching the string with a UTF-8 regexp fails
(this is mentioned in perldoc Encoding). reading the Encode man page a bit
more i get the impression that Encode::decode("iso-8859-15", $s) would
convert a string from iso-8859-15 into Perl's internal form which
according to 'perldoc perluniintro' is native eight-bit character set for
code points less that 0xff. however, inspecting the string with is_utf8()
implies that the string is in UTF-8 form as does the following regexp:

print "match\n" if (decode("iso-8859-15",'öäå') =~ m/^\x{f6}\x{e4}\x{e5}$/);

which method should be used for encoding strings into UTF-8 so that they
will work correctly with regexps? encode() + _utf8_on() or some other
method?

best regards,
--
        aspa


<Prev in Thread] Current Thread [Next in Thread>
  • Unicode support & regexp character classes, Marko Asplund <=