Hi,
Samuel L. Bayer wrote:
So the outcome was that there's a mode in GNU recode which will drop
these illegal first bytes. So the question is: is the same thing
possible in Perl Encode? The documentation for some of the FB_ variables
is tempting, but pretty opaque.
Yes, the way to do it is by using Encode::FB_QUIET. Basically, here's
how you would do it... if $text is the text you want to decode into
UTF-8, then this should do the trick:
-----
use Encode;
my $textcopy = $text;
my $encoding = "gb2312";
my $decoded = decode($encoding, $text, Encode::FB_QUIET);
while ($text ne "") { # this loops while we've still got bad
characters to deal with.
### my $badbyte = substr($text, 0, 1); # $badbyte now contains the
invalid byte.
### my $hex = sprintf("%X", ord($badbyte));
### print STDERR "Invalid character \\x" . ("0" x (1 - length($hex)))
. $hex . " in input - dropping.\n";
$text = substr($text, 1); # skip over the bad character
$decoded .= decode($encoding, $text, Encode::FB_QUIET);
}
print "Output: $decoded\n";
-----
The code as given will ignore every bad character and prints no
warnings; if you want warnings, uncomment the lines marked with ###. It
depends what you want your code to do. :D
Hope this helps!
- Ciaran.