Re: Problems with Perl Asian encodings?

2007-05-17 02:30:12

Samuel L. Bayer wrote:
So the outcome was that there's a mode in GNU recode which will drop these illegal first bytes. So the question is: is the same thing possible in Perl Encode? The documentation for some of the FB_ variables is tempting, but pretty opaque.

Yes, the way to do it is by using Encode::FB_QUIET. Basically, here's how you would do it... if $text is the text you want to decode into UTF-8, then this should do the trick:

use Encode;

my $textcopy = $text;
my $encoding = "gb2312";

my $decoded = decode($encoding, $text, Encode::FB_QUIET);

while ($text ne "") { # this loops while we've still got bad characters to deal with. ### my $badbyte = substr($text, 0, 1); # $badbyte now contains the invalid byte.
  ### my $hex = sprintf("%X", ord($badbyte));
### print STDERR "Invalid character \\x" . ("0" x (1 - length($hex))) . $hex . " in input - dropping.\n";
  $text = substr($text, 1);   # skip over the bad character
  $decoded .= decode($encoding, $text, Encode::FB_QUIET);

print "Output: $decoded\n";

The code as given will ignore every bad character and prints no warnings; if you want warnings, uncomment the lines marked with ###. It depends what you want your code to do. :D

Hope this helps!

 - Ciaran.

