perl-unicode

Re: Invalid Uicode characters

2004-01-02 17:30:06
At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters):

0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC, 0xF, 0xFFFF, 0xe

The data is coded as utf-16 and I want to keep it this way when the invalid characters are removed. Is there an easy way to do this with Perl while keeping the textual quality intact? Any advice is welcome. Thanks.

I'm not quite sure what you're saying, but suppose I write a file of four Chinese characters (zi4li4geng1sheng1) with control characters \x01, \x02, etc. strewn in among the Chinese,



$f = "/tmp/zili.txt";
open F, ">$f" ;
print F pack "H*", "FEFF0181EA10529B66F4751F";
close F ;
open F, "$f" ;
for (<F>) {
  s~[\x01\x02\x10]~~g ;
  print
}

<Prev in Thread] Current Thread [Next in Thread>