perl-unicode

Re: Warnings on illegal UTF8

1998-10-11 09:27:53
On 11 Oct 1998 12:32:37 +0200, Gisle Aas <gisle(_at_)aas(_dot_)no> said:

gisle> andreas(_dot_)koenig(_at_)dubravka(_dot_)in-berlin(_dot_)de (Andreas J. 
Koenig) writes:

gisle> One could suggest that we should make length() return undef for bad
gisle> UTF-8 strings (inside 'use utf8') and no warnings.

Hmmm.

gisle> Problem with this proposal is that it would make length() much slower.
gisle> Today it only looks at the first byte of each UTF-8 byte-sequence and
gisle> then skips all the 10xx xxxx bytes without looking at them.  Problem
gisle> with your string is that both '\xFC' and '\xDF' are legal UTF-8 start 
bytes,
gisle> and:

gisle>    utf8skip['\xFC'] is 6
gisle>    utf8skip['\xDF'] is 2

gisle> So when length() sees '\xFC' it just skip the next 5 characters and when
gisle> it sees '\xDF' it skip the final 'e'.  This gives the result you got and
gisle> no warnings.

Thanks for the explanation, this makes it understandable why the bug
is there. And it even makes it tolerable in my perception.

gisle> I agree that there should be some simple way to determine if a
gisle> sequence is valid UTF-8.  Some new pragma to make length() more
gisle> careful?

But why length(), then? My reasoning when I tried the above code was
just: -w should give me a warning whenever I use a byte sequence that
is invalid utf8. When I put the sequence directly into the program,
then I get the warning I expected:

    % perl -Ilib -wle '
    use utf8;
    print "L\xFCbeckerstra\xDFe";  
    '
    Malformed UTF-8 character at -e line 3.
    Malformed UTF-8 character at -e line 3.
    L\xFCbeckerstra\xDFe

That's why I wanted to have a validator turned on by -w. Hmmm. Maybe
not. Something to chew on.

-- 
andreas