perl-unicode

filtering out non-Japanese: my solution

2004-12-15 16:30:08
Thanks to all those who replied to my inquiry.

In the end, putting together various suggestions, I did the following:

perl -CSD -ne 'if (/\p{N}/){next}; if (/\p{P}/){next}; if (/\p{S}/){next}; if (/\p{Latin}/){next}; if (/\p{Greek}/){next}; if (/[\x{FF65}-\x{FF9F}]/){next}; if (\p{Ideographic}|\p{Hiragana}|\p{Katakana}/){print}' webcorpus.tok > webcorpus.cleaned.tok

The reason why I did not do something simpler like:

perl -CSD -ne 'if (/[\x{FF65}-\x{FF9F}]/){next}; if (/^(?:\p{Hiragana}|\p{Katakana}|\p{Ideographic})+$/){print}' webcorpus.tok

is that with this second method I'm throwing away some characters (such as the ``repetition'' kanji) that I did not want to discard.

Still, I'm a bit surprised that it is not possible to do something like:

perl -CSD -ne 'if (/^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/){print}' webcorpus.tok

or even better:

perl -CSD -ne 'if (/^\p{Japanese}+$/){print}' webcorpus.tok

Shouldn't something like that be possible?

Thanks again.

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

<Prev in Thread] Current Thread [Next in Thread>
  • filtering out non-Japanese: my solution, Marco Baroni <=