On 11 Oct 1998 12:32:37 +0200, Gisle Aas <gisle(_at_)aas(_dot_)no> said:
gisle> andreas(_dot_)koenig(_at_)dubravka(_dot_)in-berlin(_dot_)de (Andreas J.
Koenig) writes:
gisle> One could suggest that we should make length() return undef for bad
gisle> UTF-8 strings (inside 'use utf8') and no warnings.
Hmmm.
gisle> Problem with this proposal is that it would make length() much slower.
gisle> Today it only looks at the first byte of each UTF-8 byte-sequence and
gisle> then skips all the 10xx xxxx bytes without looking at them. Problem
gisle> with your string is that both '\xFC' and '\xDF' are legal UTF-8 start
bytes,
gisle> and:
gisle> utf8skip['\xFC'] is 6
gisle> utf8skip['\xDF'] is 2
gisle> So when length() sees '\xFC' it just skip the next 5 characters and when
gisle> it sees '\xDF' it skip the final 'e'. This gives the result you got and
gisle> no warnings.
Thanks for the explanation, this makes it understandable why the bug
is there. And it even makes it tolerable in my perception.
gisle> I agree that there should be some simple way to determine if a
gisle> sequence is valid UTF-8. Some new pragma to make length() more
gisle> careful?
But why length(), then? My reasoning when I tried the above code was
just: -w should give me a warning whenever I use a byte sequence that
is invalid utf8. When I put the sequence directly into the program,
then I get the warning I expected:
% perl -Ilib -wle '
use utf8;
print "L\xFCbeckerstra\xDFe";
'
Malformed UTF-8 character at -e line 3.
Malformed UTF-8 character at -e line 3.
L\xFCbeckerstra\xDFe
That's why I wanted to have a validator turned on by -w. Hmmm. Maybe
not. Something to chew on.
--
andreas