perl-unicode

Re: Invalid Uicode characters

2004-01-02 17:30:07
At 11:31 am +0100 16/9/03, z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk wrote:

I am running Perl 5.8. and trying to filter out some invalid Unicode characters from Unicoded texts of some South Asian languages. There are 28 such characters in my data (all control characters):

0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1B, 0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC, 0xF, 0xFFFF, 0xe

The data is coded as utf-16 and I want to keep it this way when the invalid characters are removed. Is there an easy way to do this with Perl while keeping the textual quality intact? Any advice is welcome. Thanks.

You don't tell us where these control characters have stuck themselves in, but if it's between characters and not between two bytes of a character, then perhaps something like this would be the answer. First I write zi4li4geng1sheng1 to a file with control single or multiple control characters inserted between the UCS-2 characters.

I then read the file in pairs and throw out any occurrences of the given control characters in the first position. I haven't tested it except with this simple example and variations of the same, but it works fine as far as it goes, so it may give you an idea until someone else comes up with a killer routine.


use strict;
my $f = "/tmp/zili.txt";
my $fout = "/tmp/ziliclean.txt";
open F, ">$f" ;
print F pack "H*", "13FEFF0181EA10529B66F4020201751F";
close F ;
open FOUT, ">$fout" or die $!;
sysopen F, $f, 0 ;
my $bytes = 1 ;
while (sysread F, $_, 1) {
  (/[\x01\x02\x10\x13]/ and $bytes == 1)
  or (print and print FOUT and $bytes += 1);
  ($bytes == 2) and $bytes = 1;
}

JD

<Prev in Thread] Current Thread [Next in Thread>