Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

On Oct 22, 2004, at 20:42, Bjoern Hoehrmann wrote:

No, you misread the bug report, I expect that

  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"

behave the same in that the malformed sequence \xF6 gets replaced by
U+FFFD as documented in `perldoc Encode` for check =Encode::FB_DEFAULT.Encode::utf8::decode_xs() fails to do that for the reason outlined inmy
bug report so the current result is


"\xF6" ALONE does not mean that the sequence is malformed.  Try

  perl -Mencoding=utf8 -le 'print "\x{180000}"' | hexdump -C

Though unicode.org does not assign any character on U+180000 (yet),"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point ofview. Perl only finds it corrupted when it reaches the following 'r'.

In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or thefollowing 'r' ? or 3 more octets? (FYI that' what \F6 suggests fromUTF-8's point of view).

  Bj
  Bj\x{FFFD}rnx

it should be

  Bj\x{FFFD}rn
  Bj\x{FFFD}rnx


So you can't really say which behavior is "correct".

I fail to see what this has to do with how Perl treats the string as
from a Perl perspective there is no real difference here, Perl works
as expected, decode() does not.

(I've posted this to RT but it again does not show up there, see
http://lists.w3.org/Archives/Public/www-archive/2004Oct/0044.html).

IMHO I believe the current implementation is correct since you can'treally tell if the sequnece iscorrupted just by looking at a given octet. At the same time I believethis should be documented somehow somewhere.


Dan the Encode Maintainer