perl-unicode

Re: real UTF-8 vs. utf8n_to_uvuni()

2004-12-06 04:30:04
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

Sine Gisle's patch make use of utf8n_to_uvuni(), it seems to be a
problem of perl core.  So I have checked utf8.c which defines that.
Seems like it does not make use of PERL_UNICODE_MAX.

The patch against utf8.c fixes that.

Seems like a good idea to have a workaround in Encode for this as
well.

Index: users/gisle/hacks/Encode/Encode.xs
--- Encode/Encode.xs.~1~        Mon Dec  6 10:44:31 2004
+++ Encode/Encode.xs    Mon Dec  6 10:44:31 2004
@@ -300,6 +300,10 @@
                                 UTF8_CHECK_ONLY | (strict ? UTF8_ALLOW_STRICT :
                                                             
UTF8_ALLOW_NONSTRICT)
                                );
+#if 1 /* perl-5.8.6 and older do not check UTF8_ALLOW_LONG */
+           if (strict && uv > PERL_UNICODE_MAX)
+               ulen = -1;
+#endif
             if (ulen == -1) {
                 if (strict) {
                     uv = utf8n_to_uvuni(s, e - s, &ulen,
End of Patch.


--- perl-5.8.x/utf8.c   Wed Nov 17 23:11:04 2004
+++ perl-5.8.x.dan/utf8.c       Sun Dec  5 11:38:52 2004
@@ -429,6 +429,13 @@
         }
         else
             uv = UTF8_ACCUMULATE(uv, *s);
+       /* Checks if ord() > 0x10FFFF -- dankogai */
+       if (uv > PERL_UNICODE_MAX){
+           if (!(flags & UTF8_ALLOW_LONG)) {
+               warning = UTF8_WARN_LONG;
+               goto malformed;
+           }
+       }
         if (!(uv > ouv)) {
             /* These cannot be allowed. */
             if (uv == ouv) {