Re: Invalid Uicode characters

At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

I am running Perl 5.8. and trying to filter out some invalid Unicodecharacters from Unicoded texts of some South Asian languages. Thereare 28 such characters in my data (all control characters):
0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19,0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,0xB, 0xC, 0xF, 0xFFFF, 0xe
The data is coded as utf-16 and I want to keep it this way when theinvalid characters are removed. Is there an easy way to do this withPerl while keeping the textual quality intact? Any advice iswelcome. Thanks.

I'm not quite sure what you're saying, but suppose I write a file offour Chinese characters (zi4li4geng1sheng1) with control characters\x01, \x02, etc. strewn in among the Chinese,




$f = "/tmp/zili.txt";
open F, ">$f" ;
print F pack "H*", "FEFF0181EA10529B66F4751F";
close F ;
open F, "$f" ;
for (<F>) {
  s~[\x01\x02\x10]~~g ;
  print
}