perl-unicode

Re: filtering out non-Japanese

2004-12-15 05:30:06
Thanks!

A very silly follow up question: where can I find the hexadecimal hiragana, katakana and kanj ranges?

M


On Wednesday, Dec 15, 2004, at 11:24 Europe/Rome, John Delacour wrote:

At 10:22 am +0100 15/12/04, Marco Baroni wrote:

I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.)


There's probably a better way to do it but here I print only characters in the hiragana range or the 0-9 range:


use encoding "UTF-8";
$line = "123_.latin,\x{30AA}fran\x{00E7}ais";
for (split //, $line) {
        m~[\x{3041}-\x{30ff}]|[0-9]~ and print;
}

JD


<Prev in Thread] Current Thread [Next in Thread>