perl-unicode

problem with user-defined unicode character properties

2007-06-11 16:05:55
Hi.

I'm working with a colleague on a linguistic project and we are both
really confused why we can't get a simple user-defined unicode
character property to work as expected.

We are trying to clean up a corpus of utf8-encoded texts which contain
mainly Russian Cyrillic by removing all the Latin text that's in them.
Of course, we need to keep the punctuation, spaces etc

If we do use the built-in property InCyrillic, like this:

print s/[\P{InCyrillic}]//g;

we get only Cyrillic strings piled up against each other, no
punctuation, no spaces and no latin letters. Which is as it should be.

But if we try to create our own character definition in a subroutine,
things stop working as expected.

#! usr/local/perl
use utf8;

sub NotInRussian{
    return <<'END';
!utf8::Cyrillic
!utf8::Punctuation
!utf8::Mark
!utf8::Number
END
}
...
s/\p{NotInRussian}//g

we only get digits without spaces, which makes no sense.

In both cases, we're calling the perl script from bash with -CS -p and
using perl 5.8.6 on Mac OS X. No error messages, only weird results.
Any help would be greatly appreciated.

All best,
Toma

<Prev in Thread] Current Thread [Next in Thread>
  • problem with user-defined unicode character properties, transpoetika <=