On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <daxim(_at_)cpan(_dot_)org>
wrote:
Let the regex engine help you advance the character counter.
$ cat langs
ΕλληνικάEnglish한국어日本語Русскийไทย
----
$ cat langs.pl
use 5.010;
use strictures;
use Unicode::UCD qw(charinfo);
sub script {
return charinfo(ord substr($_[0], 0, 1))->{script}
};
# necessary because pos() magic is tracked on the scalar.
my $copy = $_;
while (/(\X)/g) {
my $script = script $1;
my ($part) = $copy =~ /(\p{$script}+)/;
say $part;
pos($_) = pos($_) + length($part);
}
Thanks a lot!
Here is the first version of my tokenizer based on this idea:
use Lingua::ZH::MMSEG;
sub tokenize {
my $text = shift;
my @tokens;
while ( $text =~ /(\X)/g ) {
my $part = $1;
my $script = charinfo( ord $1)->{script};
$text=~ /(\p{$script}*)/g;
next if $script eq 'Common';
$part .= $1;
if( $script eq 'Han' ){
push @tokens, mmseg( $part );
}
else{
push @tokens, $part;
}
}
return @tokens;
}
And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.
--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/