perl-i18n

[FYI] Lingua::ZH::Toke (and ::TaBE)

2003-01-19 13:51:54
For people who wish to process texts in Chinese (Traditional, but
also Simplfied via Encode::HanConvert) language, I have just uploaded
Lingua::ZH::Toke on CPAN.

That module is a 'use utf8;'-friendly frontend to my Lingua::ZH::TaBE
module; it allows you to manupulate linguistic objects like below
(in big5):

    use Lingua::ZH::Toke;       # add 'utf8' to use unicode strings

    # Create Lingua::ZH::Toke::Sentence object (->Sentence also works)
    my $token = Lingua::ZH::Toke->new( '那人卻在/燈火闌珊處/益發意興闌珊' );

    # Easy tokenization via array deferencing
    print $token->[0]           # Fragment       - 那人卻在
                ->[2]           # Phrase         - 卻在
                ->[0]           # Character      - 卻
                ->[0]           # Pronounciation - ㄑㄩㄝˋ
                ->[2];          # Phonetic        - ㄝ

    # Magic histogram via hash deferencing
    print $token->{'那人卻在'};     # 1 - One such fragment there
    print $token->{'意興闌珊'};     # 1 - One such phrase there
    print $token->{'發意興闌'};     # undef - That's not a phrase
    print $token->{'珊'};        # 2 - Two such character there
    print $token->{'ㄧˋ'};       # 2 - Two such pronounciation: 益意
    print $token->{'ㄨ'};        # 3 - Three such phonetics: 那火處

    # Iteration over fragments
    while (my $fragment = <$token>) {
        # Iteration over phrases
        while (my $phrase = <$token>) {
            # ...
        }
    }

The 'phonetic' symbols are expressed in BoPoMoFo notation.
There are also various utility methods (complex segmentation, etc.);
see Lingua::ZH::TaBE for details.

Comments welcome. :-)

Thanks,
/Autrijus/

Attachment: pgp8QpyIxf6kL.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>
  • [FYI] Lingua::ZH::Toke (and ::TaBE), Autrijus Tang <=