filtering out non-Japanese

perl-unicode

[Top] [All Lists]

filtering out non-Japanese

2004-12-15 02:30:06

from [Marco Baroni]

[Permanent Link]

Dear all,

I have a long text ostensibly in utf-8, and I would like to get rid ofall the lines that contain anything BUT kanji, katakana or hiragana(thus, throwing away Latin, but also digits, punctuation, etc.)


In short, I would like to do something like:

perl -ne 'if (/[^\p{Hiragana}\p{Katakana}\p{Kanji}]/){print}'webcorpus.tok > webcorpus.clean.tok


Is is possible to do something like that?

Thanks a lot!

Marco

[More with this subject...]

<Prev in Thread]	Current Thread	[Next in Thread>
filtering out non-Japanese, Marco Baroni <= filtering out non-Japanese: ERRATA CORRIGE!, Marco Baroni Re: filtering out non-Japanese, John Delacour Re: filtering out non-Japanese, Marco Baroni Re: filtering out non-Japanese, John Delacour Re: filtering out non-Japanese, John Delacour Re: filtering out non-Japanese, Larry Wall Re: filtering out non-Japanese, Paul Bijnens

Previous by Date:	Re: making utf8-clean CPAN distributions, Darren Duncan
Next by Date:	filtering out non-Japanese: ERRATA CORRIGE!, Marco Baroni
Previous by Thread:	making utf8-clean CPAN distributions, Darren Duncan
Next by Thread:	filtering out non-Japanese: ERRATA CORRIGE!, Marco Baroni
Indexes:	[Date] [Thread] [Top] [All Lists]