Re: lots of UTF-8 warnings

2005-05-21 13:53:34
On May 21, 2005 at 12:17, Jeff Breidenbach wrote:

I'm seeing a lot of UTF-8 warnings woth 2.6.11.

Is this expected?

perl v5.8.4, mhonarc 2.6.11

Malformed UTF-8 character (1 byte, need 3, after start byte 0xef) in
unpack at /usr/share/mhonarc/MHonArc/ line 156.

Can you provide me with a sample message?

I did some looking into it when I responded to your first message,
and concluded that the warnings may be justified (albeit annoying).

However, to be thorough, I'll take a look at what you have.

From a coding perspective, a key item to examine is the UTF-8 regex
I use:

# Regex pattern for UTF-8 data
my $utf8_re = q/([\x00-\x7F]|
                  \xE0      [\xA0-\xBF][\x80-\xBF]|
                  \xF0      [\x90-\xBF][\x80-\xBF]{2}|
                  \xF8      [\x88-\xBF][\x80-\xBF]{3}|
                  \xFC      [\x84-\xBF][\x80-\xBF]{4}|
                  \xFD      [\x80-\xBF]{5}|

As you can see, '.' is included at the end.  Therefore, the regex
will capture invalid utf-8 octets.  In the warning message you
provided, it appears that the octet 0xEF occurs in the input, but it
is not followed by 2 octets in the range 0x80 to 0xBF (which defines
a valid utf-8 sequence).

Therefore, the 0xEF is captured by the '.' portion of the regex by
itself, and when passed into unpack("U0U*",...), perl generates
a warning.

One thing to consider is if having the '.' is warranted.  If left
out, invalid utf-8 octets will be quietly ignored.

Alternatively, I try to modify the code to check for the invalid
sequence myself instead of blindly passing to unpack.  I.e.  If the
regex matches a single octet, check its value.  If an 8-bit value,
quietly replace octet with invalid Unicode character entity reference


To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the

<Prev in Thread] Current Thread [Next in Thread>