perl-unicode

Re: filtering out non-Japanese

2004-12-15 05:30:07
Marco Baroni wrote:

I have a long text ostensibly in utf-8, and I would like to get rid of all the lines that contain anything BUT kanji, katakana or hiragana (thus, throwing away Latin, but also digits, punctuation, etc.)

In short, I would like to do something like:

perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}' webcorpus.tok > webcorpus.clean.tok

Is is possible to do something like that?


The current implemention (at least in v5.8.5, I don't know about the
status in v5.8.6 -- did not have time to upgrade yet) has limitations
on nesting character classes inside "[...]" character classes.
From "perldoc perlunicode":

 ·   Character classes in regular expressions match charac-­
     ters instead of bytes and match against the character
     properties specified in the Unicode properties
     database.  "\w" can be used to match a Japanese ideo-­
     graph, for instance.

     (However, and as a limitation of the current implemen-­
     tation, using "\w" or "\W" inside a "[...]" character
     class will still match with byte semantics.)


That means, in v5.8.5 this does not work:

perl -CSD -ne 'print if /^[\p{Hiragana}\p{Katakana}\p{Kanji}]+$/' f > f-clean.tok

but replacing the [...] class with a group (?:...) does work:

perl -CSD -ne 'print if /^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/' f > f-clean.tok


--
Paul Bijnens, Xplanation                            Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  
Paul(_dot_)Bijnens(_at_)xplanation(_dot_)com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit,  ZZ, :q, :q!,  M-Z, ^X^C,  logoff, logout, close, bye,  /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* kill -9 1,  Alt-F4,  Ctrl-Alt-Del,  AltGr-NumLock,  Stop-A,  ...    *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************


<Prev in Thread] Current Thread [Next in Thread>