andreas(_dot_)koenig(_at_)dubravka(_dot_)in-berlin(_dot_)de (Andreas J. Koenig)
writes:
With 5.005_52 plus Sarathy's must-apply patch, I get
% ./perl -Ilib -wle '
use utf8;
while (my $s = shift @ARGV){
print "s[$s]";
print length $s;
}
' L\xFCbeckerstra\xDFe
s[L\xFCbeckerstra\xDFe]
8
(if some mail handling mechanism kills the 8th bit, my @ARGV is one
Latin-1 word, namely "Luebeckerstrasse" spelt properly in German)
This looks like two bugs to me: no warning about bad UTF-8 and a wrong
computation of the length of the string.
In general I'd like to ask: what's considered the politically correct
way to check if a string contains legal UTF8?
One could suggest that we should make length() return undef for bad
UTF-8 strings (inside 'use utf8') and no warnings.
Problem with this proposal is that it would make length() much slower.
Today it only looks at the first byte of each UTF-8 byte-sequence and
then skips all the 10xx xxxx bytes without looking at them. Problem
with your string is that both '\xFC' and '\xDF' are legal UTF-8 start bytes,
and:
utf8skip['\xFC'] is 6
utf8skip['\xDF'] is 2
So when length() sees '\xFC' it just skip the next 5 characters and when
it sees '\xDF' it skip the final 'e'. This gives the result you got and
no warnings.
I agree that there should be some simple way to determine if a
sequence is valid UTF-8. Some new pragma to make length() more
careful?
--
Gisle Aas