perl-unicode
|
filtering out non-Japanese2004-12-15 02:30:06Dear all,I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.) In short, I would like to do something like:perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}' webcorpus.tok > webcorpus.clean.tok Is is possible to do something like that? Thanks a lot! Marco
|
|