Re: /\w/ match with 'use locale' misses letters in utf8 locale

Peter Volkov skribis 2008-07-11 10:10 (+0400):

The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does
not match Russian letter while I use locale and LC_COLLATE is set to
ru_RU.UTF-8.


\w should match Cyrillic letters even without "use locale". You might be
running into an annoying bug which makes \w lose its unicode support
depending on the *internal* state of a value. To work around this bug,
read Unicode::Semantics on CPAN and use it or utf8::upgrade.

Linux $ perl -e 'use locale; open(IN, "< test-file"); while(<IN>) { print if 
/\w/; }'
string with spaces (not only with [:alnum:])
English;
hello_привет


Despite the above there's a slightly more important issue here. You're
opening a text file but you don't specify the character encoding.
Likewise, you need to specify the encoding for output.

Assuming utf8 for both:

    perl -le'
        binmode STDOUT, ":encoding(utf8)";
        open my $in, "< :encoding(utf8)", "test-file";
        while (<$in>) {
            print "match: [$1]" if /(\w+)/;
        }
    '

Which on my system prints:

    match: [слово]
    match: [строка]
    match: [string]
    match: [English]
    match: [hello_привет]

I'm not sufficiently familiar with "use encoding" to say anything about
it, but you shouldn't need it just for this.

Do I understand correctly that we should always supply encoding of
streams?


Yes.

If yes, why in FreeBSD this works without supplying any encoding and is
it possible (good idea) to do the same in Linux?


I have no idea.
-- 
Met vriendelijke groet,  Kind regards,  Korajn salutojn,

  Juerd Waalboer:  Perl hacker  <#####(_at_)juerd(_dot_)nl>  
<http://juerd.nl/sig>
  Convolution:     ICT solutions and consultancy 
<sales(_at_)convolution(_dot_)nl>
1;