Re: UTF8_ALLOW_ANYUV should not allow overlong sequences [PATCH]

Gisle Aas wrote:

Perl use the UTF8_ALLOW_ANYUV mask in functions that should not be
restricted to only the valid Unicode code points.  For some reason
this mask currently include the UTF8_ALLOW_LONG flag.  This seems
totally wrong as there can't be a good reason to allow overlong
sequences just because we don't want to restrict the valid values.

Perl's ord() function is for instance perfectly happy with an overlong NUL:

$ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)'
0

This patch fixes this problem:


Thanks, applied as change #23632 to bleadperl (although I'm not sure I
fully understand all the implications.)

--- utf8.h.cur        2004-12-06 11:16:52.176181667 +0100
+++ utf8.h    2004-12-06 11:17:16.672129909 +0100
@@ -183,8 +183,7 @@
 #define UTF8_ALLOW_FFFF                      0x0040 /* Allows also FFFE. */
 #define UTF8_ALLOW_LONG                      0x0080
 #define UTF8_ALLOW_ANYUV             (UTF8_ALLOW_EMPTY|UTF8_ALLOW_FE_FF|\
-                                      UTF8_ALLOW_SURROGATE|\
-                                      UTF8_ALLOW_FFFF|UTF8_ALLOW_LONG)
+                                      UTF8_ALLOW_SURROGATE|UTF8_ALLOW_FFFF)
 #define UTF8_ALLOW_ANY                       0x00FF
 #define UTF8_CHECK_ONLY                      0x0200
 


With this patch the example above outputs:

$ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)'
Malformed UTF-8 character (3 bytes, need 1, after start byte 0xe0) in ord at 
-e line 1.
0


Could you turn this into a regression test ?

-- 
You probably wouldn't have expected a communist to have a dog named Harpo.
    -- Malcolm Lowry, Under the Volcano

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: real UTF-8 vs. utf8n_to_uvuni(), Gisle Aas

Next by Date:

making utf8-clean CPAN distributions, Darren Duncan

Previous by Thread:

UTF8_ALLOW_ANYUV should not allow overlong sequences [PATCH], Gisle Aas

Next by Thread:

Re: real UTF-8 vs. utf8n_to_uvuni(), Nicholas Clark

Indexes:

[Date] [Thread] [Top] [All Lists]