On May 21, 2005 at 12:17, Jeff Breidenbach wrote:
I'm seeing a lot of UTF-8 warnings woth 2.6.11.
Is this expected?
perl v5.8.4, mhonarc 2.6.11
Malformed UTF-8 character (1 byte, need 3, after start byte 0xef) in
unpack at /usr/share/mhonarc/MHonArc/CharEnt.pm line 156.
Can you provide me with a sample message?
I did some looking into it when I responded to your first message,
and concluded that the warnings may be justified (albeit annoying).
However, to be thorough, I'll take a look at what you have.
From a coding perspective, a key item to examine is the UTF-8 regex
I use:
# Regex pattern for UTF-8 data
my $utf8_re = q/([\x00-\x7F]|
[\xC0-\xDF][\x80-\xBF]|
\xE0 [\xA0-\xBF][\x80-\xBF]|
[\xE1-\xEF][\x80-\xBF]{2}|
\xF0 [\x90-\xBF][\x80-\xBF]{2}|
[\xF1-\xF7][\x80-\xBF]{3}|
\xF8 [\x88-\xBF][\x80-\xBF]{3}|
[\xF9-\xFB][\x80-\xBF]{4}|
\xFC [\x84-\xBF][\x80-\xBF]{4}|
\xFD [\x80-\xBF]{5}|
.)/;
As you can see, '.' is included at the end. Therefore, the regex
will capture invalid utf-8 octets. In the warning message you
provided, it appears that the octet 0xEF occurs in the input, but it
is not followed by 2 octets in the range 0x80 to 0xBF (which defines
a valid utf-8 sequence).
Therefore, the 0xEF is captured by the '.' portion of the regex by
itself, and when passed into unpack("U0U*",...), perl generates
a warning.
One thing to consider is if having the '.' is warranted. If left
out, invalid utf-8 octets will be quietly ignored.
Alternatively, I try to modify the code to check for the invalid
sequence myself instead of blindly passing to unpack. I.e. If the
regex matches a single octet, check its value. If an 8-bit value,
quietly replace octet with invalid Unicode character entity reference
(�).
--ewh
---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV