filtering out non-Japanese: my solution

Thanks to all those who replied to my inquiry.

In the end, putting together various suggestions, I did the following:

perl -CSD -ne 'if (/\p{N}/){next}; if (/\p{P}/){next}; if(/\p{S}/){next}; if (/\p{Latin}/){next}; if (/\p{Greek}/){next}; if(/[\x{FF65}-\x{FF9F}]/){next}; if(\p{Ideographic}|\p{Hiragana}|\p{Katakana}/){print}' webcorpus.tok >webcorpus.cleaned.tok


The reason why I did not do something simpler like:

perl -CSD -ne 'if (/[\x{FF65}-\x{FF9F}]/){next}; if(/^(?:\p{Hiragana}|\p{Katakana}|\p{Ideographic})+$/){print}'webcorpus.tok

is that with this second method I'm throwing away some characters (suchas the ``repetition'' kanji) that I did not want to discard.


Still, I'm a bit surprised that it is not possible to do something like:

perl -CSD -ne 'if(/^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/){print}' webcorpus.tok


or even better:

perl -CSD -ne 'if (/^\p{Japanese}+$/){print}' webcorpus.tok

Shouldn't something like that be possible?

Thanks again.

Regards,

Marco


---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni

Previous by Date:	Re: filtering out non-Japanese, John Delacour
Next by Date:	Re: filtering out non-Japanese, Larry Wall
Previous by Thread:	filtering out non-Japanese, Marco Baroni
Next by Thread:	"Undocumented feature" of Encode::{en,de}code(), Radoslaw Zielinski
Indexes:	[Date] [Thread] [Top] [All Lists]