Re: Invalid Uicode characters


z(_dot_)xiao(_at_)lancaster(_dot_)ac(_dot_)uk said:

I am running Perl 5.8. and trying to filter out some invalid Unicode
characters from Unicoded texts of some South Asian languages. There
are 28 such characters in my data (all control characters):

0x1, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1B,
0x1C, 0x1D, 0x1F, 0x1e, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 0xC,
0xF, 0xFFFF, 0xe

The data is coded as utf-16 and I want to keep it this way when the
invalid characters are removed. Is there an easy way to do this with
Perl while keeping the textual quality intact? Any advice is welcome.
Thanks.


If your data are utf-16, are you actually saying that you have 28 
distinct 16-bit values scattered in your data and you want to remove 
them?  i.e.:

\x{0001} \x{0002} ... \x{0010} \x{0011} \x{0012} \x{0013} ... 

Or do you mean something like: these 28 byte values are showing up
"stranded" (unpaired with a second byte that would produce a valid
utf-16 code point)?  (Or do you mean something else besides these two 
possibilities?)

(Are you really seeing \x{ffff} in your data?  or did you mean to 
indicate \x{00ff}?)

If your data contains ASCII control characters that have been rendered
in utf-16 form, these would not be considered "invalid"; in fact, some
of these control characters are quite common and useful in text
(including "tab", "line-feed", "carriage-return").  Still, if you want
to eliminate all of them, then the "tr" function is probably your best
bet.  First, you need to "decode" your utf-16 data into perl's internal
utf8 form (see the man pages for Encode, PerlIO, Encode::PerlIO, and
PerlIO::encoding) -- here's an example using PerlIO, dumping the "fixed"
text to STDOUT (e.g. for redirection to some other file):

  use Encode;
  ...

  open( IN, "<:UTF-16", "utf16.file" );
  binmode STDOUT, ":UTF-16";

  while (<>) {
     tr/\x{0001}-\x{001f}//d;
     print;
  }

(In many cases, specifying a character range in unicode like this is 
ill-conceived, but in this case, it's no problem, and it does work.)

I'm not sure whether the above will really produce a result that you
want, since it does remove carriage-returns and line-feeds, which may
cause some word-breaks to disappear from the data (and you would see
words likethis -- that used to be two words but are now oneword).

If your data contain "stranded" single bytes, you have a real problem 
on your hands.  You can use the "decode" function provided by Encode.pm 
to trap strings that contain such errors, as follows:

 use Encode;
 ...

 eval "\$_ = decode( 'UTF-16', \$utf16string, Encode::FB_CROAK )";

 if ( $@ ) {
    # $utf16string contains stuff that can't be interpreted as UTF-16
    ...
 }

But what you do inside that "if" block to handle the errors is not
likely to be obvious or easy -- stray bytes in a utf-16 stream means 
there has been some form of corruption, and the question is: how do you 
figure out exactly where (and how pervasive) the corruption really is?
(Let alone how to fix it...)

        Dave Graff