Perl use the UTF8_ALLOW_ANYUV mask in functions that should not be
restricted to only the valid Unicode code points. For some reason
this mask currently include the UTF8_ALLOW_LONG flag. This seems
totally wrong as there can't be a good reason to allow overlong
sequences just because we don't want to restrict the valid values.
Perl's ord() function is for instance perfectly happy with an overlong NUL:
$ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)'
0
This patch fixes this problem:
--- utf8.h.cur 2004-12-06 11:16:52.176181667 +0100
+++ utf8.h 2004-12-06 11:17:16.672129909 +0100
@@ -183,8 +183,7 @@
#define UTF8_ALLOW_FFFF 0x0040 /* Allows also FFFE. */
#define UTF8_ALLOW_LONG 0x0080
#define UTF8_ALLOW_ANYUV (UTF8_ALLOW_EMPTY|UTF8_ALLOW_FE_FF|\
- UTF8_ALLOW_SURROGATE|\
- UTF8_ALLOW_FFFF|UTF8_ALLOW_LONG)
+ UTF8_ALLOW_SURROGATE|UTF8_ALLOW_FFFF)
#define UTF8_ALLOW_ANY 0x00FF
#define UTF8_CHECK_ONLY 0x0200
With this patch the example above outputs:
$ perl -MEncode -wle '$a = "\xe0\x80\x80";Encode::_utf8_on($a);print ord($a)'
Malformed UTF-8 character (3 bytes, need 1, after start byte 0xe0) in ord at -e
line 1.
0
Regards,
Gisle