perl-unicode

/\w/ match with 'use locale' misses letters in utf8 locale

2008-07-10 23:11:07
Hello. Should /\w/ work with 'use locale' and correct environment set?

The problem is that in Linux (Gentoo and Debian I've tried) /\w/ does
not match Russian letter while I use locale and LC_COLLATE is set to
ru_RU.UTF-8. The most strange thing is that in FreeBSD this works. Look:

+++++++++++++++++++++++++ FreeBSD ++++++++++++++++++++++++++++++++
FreeBSD $ cat test-file
слово
строка с пробелами
string with spaces (not only with [:alnum:])
English;
hello_привет

FreeBSD $ perl -e 'open(IN, "< test-file"); while(<IN>) { print if /\w/; }'
string with spaces (not only with [:alnum:])
English;
hello_привет
FreeBSD $ perl -e 'use locale; open(IN, "< test-file"); while(<IN>) { print if 
/\w/; }'
слово
строка с пробелами
string with spaces (not only with [:alnum:])
English;
hello_привет
FreeBSD $
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++ Linux +++++++++++++++++++++++++++++++
Linux $ perl -e 'use locale; open(IN, "< test-file"); while(<IN>) { print if 
/\w/; }'
string with spaces (not only with [:alnum:])
English;
hello_привет
Linux $
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

locale -a shows that ru_RU.utf8 locale exists on both systems and I've
tried to set LANG and LC_ALL to this value with no result. Do I
understand correctly that we should always supply encoding of streams?
If yes, why in FreeBSD this works without supplying any encoding and is
it possible (good idea) to do the same in Linux?

Thank you for your time.
-- 
Peter.

<Prev in Thread] Current Thread [Next in Thread>