perl-unicode

Re: Prototype for decode_utf8 incorrect?

2003-09-26 03:30:04
Autrijus,

Thanks for the report :) -- murphy's law strikes :(

On Friday, Sep 26, 2003, at 17:23 Asia/Tokyo, Autrijus Tang wrote:
$ perl -MEncode -e'print Encode::decode_utf8(1, 1)'
Too many arguments for Encode::decode_utf8 at -e line 1, at end of line

$ perldoc Encode |grep decode_utf8
       $string = decode_utf8($octets [, CHECK]);

A tricky bug you have found.  Here is what the document says.

       $string = decode_utf8($octets [, CHECK]);
equivalent to "$string = decode("utf8", $octets [, CHECK])". The sequence of octets represented by $octets is decoded from UTF-8 into a sequence of logical characters. Not all sequences of octets form valid UTF-8 encodings, so it is possible for this call to fail. For
         CHECK, see "Handling Malformed Data".

and here is how it is really implemented:

sub decode_utf8($)
{
    my ($str) = @_;
    return undef unless utf8::decode($str);
    return $str;
}

which is RIGHT so long as the prototype of utf8::decode() is '$'

% perl -e 'print utf8::decode()'
Usage: utf8::decode(sv) at -e line 1.
% perl -e 'print utf8::decode(1)'
1
% perl -le 'print utf8::decode(1,1)'
Usage: utf8::decode(sv) at -e line 1.

and utf8::decode is not designed to return status.

% perl -MEncode -e 'print decode_utf8("\xC2\x80")' | hexdump -C
00000000  80                                                |.|
00000001
% perl -MEncode -e 'print decode_utf8("\x80")' | hexdump -C
% perl -MEncode -e 'print decode_utf8("\x7f")' | hexdump -C
00000000  7f                                                |.|
00000001

I consider this a feature bug than a documentation bug. But I wonder how I should fix it. fixing utf8::decode() involves tweaking core so it would be nice if it can be fixed on Encode side. Fortunately Encode::decode("utf8" => $str) works.

% perl -MEncode -e '$a="\xC2\x80"; print decode("utf8"=>$a, 1)' | hexdump -C
00000000  80                                                |.|
00000001
% perl -MEncode -e '$a="\x80"; print decode("utf8"=>$a, 1)' | hexdump -C utf8 "\x80" does not map to Unicode at /usr/local/lib/perl5/5.8.0/i386-freebsd/Encode.pm line 164. % perl -MEncode -e '$a="\x7f"; print decode("utf8"=>$a, 1)' | hexdump -C
00000000  7f                                                |.|
00000001

so we can make decode_utf8() as follows;

sub decode_utf8($;$)
{
    my ($str, $check) = @_;
    if ($check){
                return decode("utf8", @_);
        }else{
                return undef unless utf8::decode($str);
                return $str;
        }
}

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>