Re: Word boundaries

On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <daxim(_at_)cpan(_dot_)org> 
wrote:

Let the regex engine help you advance the character counter.

   $ cat langs
   ΕλληνικάEnglish한국어日本語Русскийไทย

----

   $ cat langs.pl
   use 5.010;
   use strictures;
   use Unicode::UCD qw(charinfo);

   sub script {
       return charinfo(ord substr($_[0], 0, 1))->{script}
   };

   # necessary because pos() magic is tracked on the scalar.
   my $copy = $_;
   while (/(\X)/g) {
       my $script = script $1;
       my ($part) = $copy =~ /(\p{$script}+)/;
       say $part;
       pos($_) = pos($_) + length($part);
   }


Thanks a lot!

Here is the first version of my tokenizer based on this idea:


use Lingua::ZH::MMSEG;

sub tokenize {
    my $text = shift;
    my @tokens;
    while ( $text =~ /(\X)/g ) {
        my $part = $1;
        my $script = charinfo( ord $1)->{script};
        $text=~ /(\p{$script}*)/g;
        next if $script eq 'Common';
        $part .= $1;
        if( $script eq 'Han' ){
            push @tokens, mmseg( $part );
        }
        else{
            push @tokens, $part;
        }
    }
    return @tokens;
}

And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.

-- 
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/