Re: UTF-8 warnings in 2,6,11

On May 21, 2005 at 02:00, Jeff Breidenbach wrote:

I'm seeing quite a few UTF-8 warnings on 2.6.11. Is this 
expected?


It believe so.  The fix for bug #11187 activates perl's built-in UTF-8
sequence checks.  It appears it is common for email tagged with utf-8
encoding to have invalid utf-8 sequences.

I get a lot of warnings in the utf-8 sample message I have that contains
(deliberate) malformed utf-8 sequences.  Where the sequences are good,
no warnings are generated.

As noted in the bug's comments, I do not understand why I needed to
make the fix in the first place.  My guess is something was wrong
with perl between different versions.  According to latest docs at
perldoc.perl.org, the lone 'U' template for unpack should work
always, <http://perldoc.perl.org/perluniintro.html>:

    For UTF-8 only, you can use:

        use warnings;
        @chars = unpack("U0U*", $string_of_bytes_that_I_think_is_utf8);

    If invalid, a Malformed UTF-8 character (byte 0x##) in unpack
    warning is produced. The "U0" means "expect strictly UTF-8 encoded
    Unicode". Without that the unpack("U*", ...) would accept also
    data like chr(0xFF), similarly to the pack as we saw earlier.

With that said, the fix is probably better since perl validates
the sequence internally and generates a warning if the sequence is bad.

Right now, I will not do any more research unless you (or someone else)
can provide example UTF-8 input that should not generate malformed
warning messages.  If you can isolate a message that does generate
the warnings, I can help you examine it to see if the warnings are
justified.

--ewh

---------------------------------------------------------------------
To sign-off this list, send email to majordomo(_at_)mhonarc(_dot_)org with the
message text UNSUBSCRIBE MHONARC-DEV