perl-unicode

Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 09:30:10
* Dan Kogai wrote:
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
  perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"

Though unicode.org does not assign any character on U+180000 (yet), 
"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of 
view.  Perl only finds it corrupted when it reaches the following 'r'.

In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the 
following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from 
UTF-8's point of view).

C12a in Unicode 4.0.1 notes

[...]
  For example, in UTF-8 every code unit of the form 110xxxx must be
  followed by a code unit of the form 10xxxxxx. A sequence such as
  110xxxxx 0xxxxxxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or
  interpreting text, a conformant process must treat the first code unit
  110xxxxx as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]

IOW, the \xF6. According to `perldoc Encode`

[...]
  *CHECK* = Encode::FB_DEFAULT ( == 0)
    If *CHECK* is 0, (en|de)code will put a *substitution character* in
    place of a malformed character. For UCM-based encodings, <subchar>
    will be used. For Unicode, the code point 0xFFFD is used. If the
    data is supposed to be UTF-8, an optional lexical warning (category
    utf8) is given.
[...]

the module chooses the replacement character approach and I thus expect
that none of

  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6rn")
  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6r")
  decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6")

holds true and I expect that

  my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
  decode("utf-8", $x, Encode::FB_CROAK);

croaks. The partial decoding approach is useful but only if check is set
to something where the remaining octets are made available to the script
and not for check == 0. Why would anyone want it to behave differently?

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is
documented as

[...]
  is_utf8(STRING [, CHECK])
    [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
    If CHECK is true, also checks the data in STRING for being
    well-formed UTF-8. Returns true if successful, false otherwise.
[...]

And D36 in Unicode 4.0.1 is very clear that

[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.
[...]

I would thus never expect that

  Encode::is_utf8(decode(utf8 => qq(\xF6\x80\x80\x80)), 1)

returns true or that

  my $x = qq(\xF6\x80\x80\x80);
  decode(utf8 => $x, Encode::FB_CROAK);

does not croak. The byte string here is *not* well-formed UTF-8! I do
not really understand why it one would expect something different.

If this is really intentional and kept unchanged, there should at least
be highly visible warnings in the documentation on when malformed input
is ignored silently (and/or where "UTF-8" does not mean UTF-8 as defined
in Unicode or RFC 3629). Clearly, if "well-formed UTF-8" means something
different in Perl and outside Perl people necessarily get confused...

[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"]
[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"]

IMHO I believe the current implementation is correct since you can't 
really tell if the sequnece is corrupted just by looking at a given octet.

Well, there is no need to look at just a single octet here, nothing
stops the routine from checking the octets following 0xF6, so I would
say there needs to be a better reason to consider this behavior correct.
I do not think the implementation matches the documentation or what one
would expect from the Unicode standard.