Marco Baroni wrote:
I have a long text ostensibly in utf-8, and I would like to get rid of
all the lines that contain anything BUT kanji, katakana or hiragana
(thus, throwing away Latin, but also digits, punctuation, etc.)
In short, I would like to do something like:
perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}'
webcorpus.tok > webcorpus.clean.tok
Is is possible to do something like that?
The current implemention (at least in v5.8.5, I don't know about the
status in v5.8.6 -- did not have time to upgrade yet) has limitations
on nesting character classes inside "[...]" character classes.
From "perldoc perlunicode":
· Character classes in regular expressions match charac-
ters instead of bytes and match against the character
properties specified in the Unicode properties
database. "\w" can be used to match a Japanese ideo-
graph, for instance.
(However, and as a limitation of the current implemen-
tation, using "\w" or "\W" inside a "[...]" character
class will still match with byte semantics.)
That means, in v5.8.5 this does not work:
perl -CSD -ne 'print if /^[\p{Hiragana}\p{Katakana}\p{Kanji}]+$/' f >
f-clean.tok
but replacing the [...] class with a group (?:...) does work:
perl -CSD -ne 'print if /^(?:\p{Hiragana}|\p{Katakana}|\p{Kanji})+$/' f
> f-clean.tok
--
Paul Bijnens, Xplanation Tel +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512
http://www.xplanation.com/ email:
Paul(_dot_)Bijnens(_at_)xplanation(_dot_)com
***********************************************************************
* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, *
* quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, *
* kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ... "Are you sure?" ... YES ... Phew ... I'm out *
***********************************************************************