Oops: what I would like to do is actually:
perl -ne 'if (!/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}'
webcorpus.tok > webcorpus.clean.tok
Sorry for the imprecision...
Marco
On Wednesday, Dec 15, 2004, at 10:22 Europe/Rome, Marco Baroni wrote:
Dear all,
I have a long text ostensibly in utf-8, and I would like to get rid of
all the lines that contain anything BUT kanji, katakana or hiragana
(thus, throwing away Latin, but also digits, punctuation, etc.)
In short, I would like to do something like:
perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}'
webcorpus.tok > webcorpus.clean.tok
Is is possible to do something like that?
Thanks a lot!
Marco