perl-unicode

Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

2004-10-22 11:30:08
On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote:
C12a in Unicode 4.0.1 notes

[...]
  For example, in UTF-8 every code unit of the form 110xxxx must be
  followed by a code unit of the form 10xxxxxx. A sequence such as
  110xxxxx 0xxxxxxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or
interpreting text, a conformant process must treat the first code unit
  110xxxxx as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]
[snip]

Okay, you win. You have convinced me that Encode::utf8 should behave the same as Encode::XS (UCM-base encodings). And the patch to make that way is deceptively simple, as follow;

===================================================================
RCS file: Encode.xs,v
retrieving revision 2.0
diff -u -r2.0 Encode.xs
--- Encode.xs   2004/05/16 20:55:15     2.0
+++ Encode.xs   2004/10/22 18:00:29
@@ -297,7 +297,7 @@
            U8 skip = UTF8SKIP(s);
            if ((s + skip) > e) {
                /* Partial character - done */
-               break;
+               goto decode_utf8_fallback;
            }
            else if (is_utf8_char(s)) {
                /* Whole char is good */
@@ -313,6 +313,7 @@
            /* Invalid start byte */
        }
        /* If we get here there is something wrong with alleged UTF-8 */
+    decode_utf8_fallback:
        if (check & ENCODE_DIE_ON_ERR){
            Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s);
            XSRETURN(0);

===================================================================

The most decisive comment of yours is this:

holds true and I expect that

  my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
  decode("utf-8", $x, Encode::FB_CROAK);

croaks.

Which apparently did not. Thank you for being so persitent on this problem. I'd be honor to add your name to AUTHORS file for this.

I will $Encode::VERSION++ as soon as I am done w/ the test suites and Tel's patch. This time I will be careful not to screw up (maint|bread)perl so give me some time before the update is ready (but I won't keep you waiting for too long since 5.8.6 deadline is soon).

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8 is
documented as

[...]
  is_utf8(STRING [, CHECK])
    [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
    If CHECK is true, also checks the data in STRING for being
    well-formed UTF-8. Returns true if successful, false otherwise.
[...]

And D36 in Unicode 4.0.1 is very clear that

[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.
[...]

That's because perl's notion of Unicode is broader than that of unicode.org. So far Unicode.org's mapping only spans from U+0000 to U+1fFFFF, While that of perl is U+ffffFFFF or even U+ffffFFFFffffFFFF (in other words, MAX_UINT). See Camel 3 on details.

And I think we can leave this :)

Dan the Encode Maintainer