Thanks to all those who replied to my inquiry.
In the end, putting together various suggestions, I did the following:
perl -CSD -ne 'if (/\p{N}/){next}; if (/\p{P}/){next}; if
(/\p{S}/){next}; if (/\p{Latin}/){next}; if (/\p{Greek}/){next}; if
(/[\x{FF65}-\x{FF9F}]/){next}; if
(\p{Ideographic}|\p{Hiragana}|\p{Katakana}/){print}' webcorpus.tok >
webcorpus.cleaned.tok
The reason why I did not do something simpler like:
perl -CSD -ne 'if (/[\x{FF65}-\x{FF9F}]/){next}; if
(/^(?:\p{Hiragana}|\p{Katakana}|\p{Ideographic})+$/){print}'
webcorpus.tok
is that with this second method I'm throwing away some characters (such
as the ``repetition'' kanji) that I did not want to discard.
Still, I'm a bit surprised that it is not possible to do something like:
perl -CSD -ne 'if
(/^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/){print}' webcorpus.tok
or even better:
perl -CSD -ne 'if (/^\p{Japanese}+$/){print}' webcorpus.tok
Shouldn't something like that be possible?
Thanks again.
Regards,
Marco
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni