Re: filtering out non-Japanese

Marco Baroni wrote:

I have a long text ostensibly in utf-8, and I would like to get rid ofall the lines that contain anything BUT kanji, katakana or hiragana(thus, throwing away Latin, but also digits, punctuation, etc.)
In short, I would like to do something like:
perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}'webcorpus.tok > webcorpus.clean.tok
Is is possible to do something like that?



The current implemention (at least in v5.8.5, I don't know about the
status in v5.8.6 -- did not have time to upgrade yet) has limitations
on nesting character classes inside "[...]" character classes.
From "perldoc perlunicode":

 ·   Character classes in regular expressions match charac-
     ters instead of bytes and match against the character
     properties specified in the Unicode properties
     database.  "\w" can be used to match a Japanese ideo-
     graph, for instance.

     (However, and as a limitation of the current implemen-
     tation, using "\w" or "\W" inside a "[...]" character
     class will still match with byte semantics.)


That means, in v5.8.5 this does not work:

perl -CSD -ne 'print if /^[\p{Hiragana}\p{Katakana}\p{Kanji}]+$/' f >f-clean.tok


but replacing the [...] class with a group (?:...) does work:

perl -CSD -ne 'print if /^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/' f> f-clean.tok



--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  
Paul(_dot_)Bijnens(_at_)xplanation(_dot_)com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************