perl-unicode

real UTF-8 vs. utf8n_to_uvuni()

2004-12-04 20:30:07
On Dec 05, 2004, at 10:56, Dan Kogai wrote:
Thanks, applied in my repository. New tests and documentation fix in progress. When I am done w/ that, I will release Encode-2.0901 on my web (not CPAN yet). When cross-checks by porters are done I will release Encode-2.10.

Dan the Encode Maintainer

Now I am writing test suites and found some of the strictures are missing.

Surrogate -- OK
% perl -Mblib -MEncode -le '$a="\x{d801}"; print encode("UTF-8", $a, 1)'
"\x{d801}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

U+FFFF -- OK
% perl -Mblib -MEncode -le '$a="\x{ffff}"; print encode("UTF-8", $a, 1)'
"\x{ffff}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

Chars above U+10FFFF -- NOT OK
%> perl -Mblib -MEncode -le '$a="\x{11ffff}"; print encode("UTF-8", $a, 1)'
????

Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a problem of perl core. So I have checked utf8.c which defines that. Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.

> ~/danperl/bin/perl5.8.6 -Mblib -MEncode -le '$a="\x{11FFFF}"; print encode("UTF-8", $a, 1)' "\x{00f4}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

As you see, the warning is still funny. But for any case w/ UTF8_WARN_LONG is funny as follows;

> perl -Mblib -MEncode -le '$a="\x{7fff_ffff}"; print encode("UTF-8", $a, 1)'
??????
> perl -Mblib -MEncode -le '$a="\x{8000_0000}"; print encode("UTF-8", $a, 1)' "\x{00fe}" does not map to utf8 at /gs1/dankogai/work/Encode/blib/lib/Encode.pm line 150.

I have tracked down and found this warning was handled by Encode so Gisle and I can fix that.

Dan the Encode Maintainer

--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c       Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
        }
        else
            uv = UTF8_ACCUMULATE(uv, *s);
+       /* Checks if ord() > 0x10FFFF -- dankogai */
+       if (uv > PERL_UNICODE_MAX){
+           if (!(flags & UTF8_ALLOW_LONG)) {
+               warning = UTF8_WARN_LONG;
+               goto malformed;
+           }
+       }
        if (!(uv > ouv)) {
            /* These cannot be allowed. */
            if (uv == ouv) {