The Encode manpage says this about FB_QUIET:
| CHECK = Encode::FB_QUIET
|
| If CHECK is set to Encode::FB_QUIET, (en|de)code will
| immediately return the portion of the data that has been
| processed so far when an error occurs. The data argument will
| be overwritten with everything after that point (that is, the
| unprocessed part of data). This is handy when you have to
| call decode repeatedly in the case where your source data may
| contain partial multi-byte character sequences, for example
| because you are reading with a fixed-width buffer. Here is
| some sample code that does exactly this:
|
| my $data = ''; my $utf8 = '';
| while(defined(read $fh, $buffer, 256)){
| # buffer may end in a partial character so we append
| $data .= $buffer;
| $utf8 .= decode($encoding, $data, Encode::FB_QUIET);
| # $data now contains the unprocessed partial character
| }
First off this sample code is no good since this loop will normally
never terminate as read() only returns undef on failure and EOF is not
a failure.
Second we will end up accumulating the resf of the file in $buffer as
soon as we encounter a bad byte in the stream. We need to distinguish
between bad stuff and incomplete sequences. Also note that an
incomplete sequences at EOF is bad stuff.
I believe this function will do the right thing:
use Encode;
sub read_utf8 {
my($fh, $bad_byte_cb) = @_;
my $str = ""; # where we accumulate the result
my $buf = "";
my $n;
do {
$n = read($fh, $buf, 16, length($buf));
die "Can't read: $!" unless defined $n;
while (length $buf) {
$str .= Encode::decode("UTF-8", $buf, Encode::FB_QUIET);
last if $n && length($buf) < 4; # possibly an incomplete char
if (length($buf)) {
my $bad_byte = substr($buf, 0, 1, "");
$str .= &$bad_byte_cb(ord($bad_byte)) if $bad_byte_cb;
}
}
} while $n;
return $str;
}
# test it
use Data::Dump;
print Data::Dump::dump(read_utf8(*STDIN, sub { sprintf "%%%02X", shift })),
"\n";
so I suggest adding this as a example to the documentation. What I
don't like here is the test for incomplete char. What I really want
is for Encode::decode() to tell me what the situation is, so I want to
extend its API. The simplest way seems to just add another argument
that is updated to reflect this status.
Encode::decode("UTF-8", $buf, Encode::FB_QUIET, $incomplete);
where $incomplete will be TRUE iff there is stuff left in $buf and the
reason is that more data is needed to decode properly. Is this an
acceptable extension?
Regards,
Gisle