Re: lots of UTF-8 warnings

On May 21, 2005 at 12:17, Jeff Breidenbach wrote:

I'm seeing a lot of UTF-8 warnings woth 2.6.11.

Is this expected?

perl v5.8.4, mhonarc 2.6.11

Malformed UTF-8 character (1 byte, need 3, after start byte 0xef) in
unpack at /usr/share/mhonarc/MHonArc/CharEnt.pm line 156.


Can you provide me with a sample message?

I did some looking into it when I responded to your first message,
and concluded that the warnings may be justified (albeit annoying).

However, to be thorough, I'll take a look at what you have.

From a coding perspective, a key item to examine is the UTF-8 regex

I use:

# Regex pattern for UTF-8 data
my $utf8_re = q/([\x00-\x7F]|
                 [\xC0-\xDF][\x80-\xBF]|
                  \xE0      [\xA0-\xBF][\x80-\xBF]|
                 [\xE1-\xEF][\x80-\xBF]{2}|
                  \xF0      [\x90-\xBF][\x80-\xBF]{2}|
                 [\xF1-\xF7][\x80-\xBF]{3}|
                  \xF8      [\x88-\xBF][\x80-\xBF]{3}|
                 [\xF9-\xFB][\x80-\xBF]{4}|
                  \xFC      [\x84-\xBF][\x80-\xBF]{4}|
                  \xFD      [\x80-\xBF]{5}|
                 .)/;

As you can see, '.' is included at the end.  Therefore, the regex
will capture invalid utf-8 octets.  In the warning message you
provided, it appears that the octet 0xEF occurs in the input, but it
is not followed by 2 octets in the range 0x80 to 0xBF (which defines
a valid utf-8 sequence).

Therefore, the 0xEF is captured by the '.' portion of the regex by
itself, and when passed into unpack("U0U*",...), perl generates
a warning.

One thing to consider is if having the '.' is warranted.  If left
out, invalid utf-8 octets will be quietly ignored.

Alternatively, I try to modify the code to check for the invalid
sequence myself instead of blindly passing to unpack.  I.e.  If the
regex matches a single octet, check its value.  If an 8-bit value,
quietly replace octet with invalid Unicode character entity reference
(&#xFFFD).

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV