Re: [cpan #8089] Encode::utf8::decode_xs does not check partial chars

On Oct 23, 2004, at 01:04, Bjoern Hoehrmann wrote:

C12a in Unicode 4.0.1 notes

[...]
  For example, in UTF-8 every code unit of the form 110xxxx must be
  followed by a code unit of the form 10xxxxxx. A sequence such as
  110xxxxx 0xxxxxxx is illformed and must never be generated. When
  faced with this ill-formed code unit sequence while transforming or

interpreting text, a conformant process must treat the first codeunit

  110xxxxx as an illegally terminated code unit sequence--for example,
  by signaling an error, filtering the code unit out, or representing
  the code unit with a marker such as U+FFFD
[...]
[snip]

Okay, you win. You have convinced me that Encode::utf8 should behavethe same as Encode::XS (UCM-base encodings). And the patch to makethat way is deceptively simple, as follow;


===================================================================
RCS file: Encode.xs,v
retrieving revision 2.0
diff -u -r2.0 Encode.xs
--- Encode.xs   2004/05/16 20:55:15     2.0
+++ Encode.xs   2004/10/22 18:00:29
@@ -297,7 +297,7 @@
            U8 skip = UTF8SKIP(s);
            if ((s + skip) > e) {
                /* Partial character - done */
-               break;
+               goto decode_utf8_fallback;
            }
            else if (is_utf8_char(s)) {
                /* Whole char is good */
@@ -313,6 +313,7 @@
            /* Invalid start byte */
        }
        /* If we get here there is something wrong with alleged UTF-8 */
+    decode_utf8_fallback:
        if (check & ENCODE_DIE_ON_ERR){
            Perl_croak(aTHX_ ERR_DECODE_NOMAP, "utf8", (UV)*s);
            XSRETURN(0);

===================================================================

The most decisive comment of yours is this:

holds true and I expect that

  my $x = "Bj\xF6rn"; # as well as "Bj\xF6r" and "Bj\xF6"
  decode("utf-8", $x, Encode::FB_CROAK);

croaks.

Which apparently did not. Thank you for being so persitent on thisproblem. I'd be honor to add your name to AUTHORS file for this.

I will $Encode::VERSION++ as soon as I am done w/ the test suites andTel's patch. This time I will be careful not to screw up(maint|bread)perl so give me some time before the update is ready (butI won't keep you waiting for too long since 5.8.6 deadline is soon).

Your statement about \xF6\x80\x80\x80 is interesting, Encode::is_utf8is

documented as

[...]
  is_utf8(STRING [, CHECK])
    [INTERNAL] Tests whether the UTF-8 flag is turned on in the STRING.
    If CHECK is true, also checks the data in STRING for being
    well-formed UTF-8. Returns true if successful, false otherwise.
[...]

And D36 in Unicode 4.0.1 is very clear that

[...]
  As a consequence of the well-formedness conditions specified in Table
  3-6, the following byte values are disallowed in UTF-8: C0–C1, F5–FF.
[...]

That's because perl's notion of Unicode is broader than that ofunicode.org. So far Unicode.org's mapping only spans from U+0000 toU+1fFFFF, While that of perl is U+ffffFFFF or even U+ffffFFFFffffFFFF(in other words, MAX_UINT). See Camel 3 on details.


And I think we can leave this :)

Dan the Encode Maintainer