* Dan Kogai wrote:
perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"
perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"
Though unicode.org does not assign any character on U+180000 (yet),
"\xF6\x80\x80\x80" is a valid UTF-8 character from perl's point of
view. Perl only finds it corrupted when it reaches the following 'r'.
In such cases, WHAT PART OF THE SEQUENCE IS CORRUPTED? \xF6 ? or the
following 'r' ? or 3 more octets? (FYI that' what \F6 suggests from
UTF-8's point of view).
C12a in Unicode 4.0.1 notes
[...]
For example, in UTF-8 every code unit of the form 110xxxx must be
followed by a code unit of the form 10xxxxxx. A sequence such as
110xxxxx 0xxxxxxx is illformed and must never be generated. When
faced with this ill-formed code unit sequence while transforming or
interpreting text, a conformant process must treat the first code unit
110xxxxx as an illegally terminated code unit sequence--for example,
by signaling an error, filtering the code unit out, or representing
the code unit with a marker such as U+FFFD
[...]
IOW, the \xF6. According to `perldoc Encode`
[...]
*CHECK* = Encode::FB_DEFAULT ( == 0)
If *CHECK* is 0, (en|de)code will put a *substitution character* in
place of a malformed character. For UCM-based encodings, <subchar>
will be used. For Unicode, the code point 0xFFFD is used. If the
data is supposed to be UTF-8, an optional lexical warning (category
utf8) is given.
[...]
the module chooses the replacement character approach and I thus expect
that none of
decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6rn")
decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6r")
decode("utf-8", "Bj") eq decode("utf-8", "Bj\xF6")
holds true and I expect that
my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
decode("utf-8", $x, Encode::FB_CROAK);
croaks. The partial decoding approach is useful but only if check is set
to something where the remaining octets are made available to the script
and not for check == 0. Why would anyone want it to behave differently?
Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is
documented as
[...]
is_utf8(STRING [, CHECK])
[INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
If CHECK is true, also checks the data in STRING for being
well-formed UTF-8. Returns true if successful, false otherwise.
[...]
And D36 in Unicode 4.0.1 is very clear that
[...]
As a consequence of the well-formedness conditions specified in Table
3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.
[...]
I would thus never expect that
Encode::is_utf8(decode(utf8 => qq(\xF6\x80\x80\x80)), 1)
returns true or that
my $x = qq(\xF6\x80\x80\x80);
decode(utf8 => $x, Encode::FB_CROAK);
does not croak. The byte string here is *not* well-formed UTF-8! I do
not really understand why it one would expect something different.
If this is really intentional and kept unchanged, there should at least
be highly visible warnings in the documentation on when malformed input
is ignored silently (and/or where "UTF-8" does not mean UTF-8 as defined
in Unicode or RFC 3629). Clearly, if "well-formed UTF-8" means something
different in Perl and outside Perl people necessarily get confused...
[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rn))"]
[perl -MEncode -e "print decode(q(utf-8), qq(Bj\xF6rnx))"]
IMHO I believe the current implementation is correct since you can't
really tell if the sequnece is corrupted just by looking at a given octet.
Well, there is no need to look at just a single octet here, nothing
stops the routine from checking the octets following 0xF6, so I would
say there needs to be a better reason to consider this behavior correct.
I do not think the implementation matches the documentation or what one
would expect from the Unicode standard.