perl-unicode

filtering out non-Japanese: ERRATA CORRIGE!

2004-12-15 02:30:07
Oops: what I would like to do is actually:

perl -ne 'if (!/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}' webcorpus.tok > webcorpus.clean.tok

Sorry for the imprecision...

Marco


On Wednesday, Dec 15, 2004, at 10:22 Europe/Rome, Marco Baroni wrote:

Dear all,

I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.)

In short, I would like to do something like:

perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}' webcorpus.tok > webcorpus.clean.tok

Is is possible to do something like that?

Thanks a lot!

Marco


<Prev in Thread] Current Thread [Next in Thread>