For people who wish to process texts in Chinese (Traditional, but
also Simplfied via Encode::HanConvert) language, I have just uploaded
Lingua::ZH::Toke on CPAN.
That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE
module; it allows you to manupulate linguistic objects like below
(in big5):
use Lingua::ZH::Toke; # add 'utf8' to use unicode strings
# Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
my $token = Lingua::ZH::Toke->new( '那人卻在/燈火闌珊處/益發意興闌珊' );
# Easy tokenization via array deferencing
print $token->[0] # Fragment - 那人卻在
->[2] # Phrase - 卻在
->[0] # Character - 卻
->[0] # Pronounciation - ㄑㄩㄝˋ
->[2]; # Phonetic - ㄝ
# Magic histogram via hash deferencing
print $token->{'那人卻在'}; # 1 - One such fragment there
print $token->{'意興闌珊'}; # 1 - One such phrase there
print $token->{'發意興闌'}; # undef - That's not a phrase
print $token->{'珊'}; # 2 - Two such character there
print $token->{'ㄧˋ'}; # 2 - Two such pronounciation: 益意
print $token->{'ㄨ'}; # 3 - Three such phonetics: 那火處
# Iteration over fragments
while (my $fragment = <$token>) {
# Iteration over phrases
while (my $phrase = <$token>) {
# ...
}
}
The 'phonetic' symbols are expressed in BoPoMoFo notation.
There are also various utility methods (complex segmentation, etc.);
see Lingua::ZH::TaBE for details.
Comments welcome. :-)
Thanks,
/Autrijus/
pgp8QpyIxf6kL.pgp
Description: PGP signature