Autrijus,
Thanks for the report :) -- murphy's law strikes :(
On Friday, Sep 26, 2003, at 17:23 Asia/Tokyo, Autrijus Tang wrote:
$ perl -MEncode -e'print Encode::decode_utf8(1, 1)'
Too many arguments for Encode::decode_utf8 at -e line 1, at end of line
$ perldoc Encode |grep decode_utf8
$string = decode_utf8($octets [, CHECK]);
A tricky bug you have found. Here is what the document says.
$string = decode_utf8($octets [, CHECK]);
equivalent to "$string = decode("utf8", $octets [, CHECK])".
The
sequence of octets represented by $octets is decoded from
UTF-8 into
a sequence of logical characters. Not all sequences of octets
form
valid UTF-8 encodings, so it is possible for this call to
fail. For
CHECK, see "Handling Malformed Data".
and here is how it is really implemented:
sub decode_utf8($)
{
my ($str) = @_;
return undef unless utf8::decode($str);
return $str;
}
which is RIGHT so long as the prototype of utf8::decode() is '$'
% perl -e 'print utf8::decode()'
Usage: utf8::decode(sv) at -e line 1.
% perl -e 'print utf8::decode(1)'
1
% perl -le 'print utf8::decode(1,1)'
Usage: utf8::decode(sv) at -e line 1.
and utf8::decode is not designed to return status.
% perl -MEncode -e 'print decode_utf8("\xC2\x80")' | hexdump -C
00000000 80 |.|
00000001
% perl -MEncode -e 'print decode_utf8("\x80")' | hexdump -C
% perl -MEncode -e 'print decode_utf8("\x7f")' | hexdump -C
00000000 7f |.|
00000001
I consider this a feature bug than a documentation bug. But I wonder
how I should fix it. fixing utf8::decode() involves tweaking core so
it would be nice if it can be fixed on Encode side. Fortunately
Encode::decode("utf8" => $str) works.
% perl -MEncode -e '$a="\xC2\x80"; print decode("utf8"=>$a, 1)' |
hexdump -C
00000000 80 |.|
00000001
% perl -MEncode -e '$a="\x80"; print decode("utf8"=>$a, 1)' | hexdump
-C
utf8 "\x80" does not map to Unicode at
/usr/local/lib/perl5/5.8.0/i386-freebsd/Encode.pm line 164.
% perl -MEncode -e '$a="\x7f"; print decode("utf8"=>$a, 1)' | hexdump
-C
00000000 7f |.|
00000001
so we can make decode_utf8() as follows;
sub decode_utf8($;$)
{
my ($str, $check) = @_;
if ($check){
return decode("utf8", @_);
}else{
return undef unless utf8::decode($str);
return $str;
}
}
Dan the Encode Maintainer