Re: Invalid Uicode characters

At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

I am running Perl 5.8. and trying to filter out some invalid Unicodecharacters from Unicoded texts of some South Asian languages. Thereare 28 such characters in my data (all control characters):
0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19,0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8,0xB, 0xC, 0xF, 0xFFFF, 0xe
The data is coded as utf-16 and I want to keep it this way when theinvalid characters are removed. Is there an easy way to do this withPerl while keeping the textual quality intact? Any advice iswelcome. Thanks.

You don't tell us where these control characters have stuckthemselves in, but if it's between characters and not between twobytes of a character, then perhaps something like this would be theanswer. First I write zi4li4geng1sheng1 to a file with controlsingle or multiple control characters inserted between the UCS-2characters.

I then read the file in pairs and throw out any occurrences of thegiven control characters in the first position. I haven't tested itexcept with this simple example and variations of the same, but itworks fine as far as it goes, so it may give you an idea untilsomeone else comes up with a killer routine.



use strict;
my $f = "/tmp/zili.txt";
my $fout = "/tmp/ziliclean.txt";
open F, ">$f" ;
print F pack "H*", "13FEFF0181EA10529B66F4020201751F";
close F ;
open FOUT, ">$fout" or die $!;
sysopen F, $f, 0 ;
my $bytes = 1 ;
while (sysread F, $_, 1) {
  (/[\x01\x02\x10\x13]/ and $bytes == 1)
  or (print and print FOUT and $bytes += 1);
  ($bytes == 2) and $bytes = 1;
}

JD