Re: Warnings on illegal UTF8

andreas(_dot_)koenig(_at_)dubravka(_dot_)in-berlin(_dot_)de (Andreas J. Koenig) 
writes:

With 5.005_52 plus Sarathy's must-apply patch, I get

  % ./perl -Ilib -wle '
  use utf8;                  
  while (my $s = shift @ARGV){
   print "s[$s]";           
   print length $s;
  }
  ' L\xFCbeckerstra\xDFe
  s[L\xFCbeckerstra\xDFe]
  8

(if some mail handling mechanism kills the 8th bit, my @ARGV is one
Latin-1 word, namely "Luebeckerstrasse" spelt properly in German)

This looks like two bugs to me: no warning about bad UTF-8 and a wrong
computation of the length of the string.

In general I'd like to ask: what's considered the politically correct
way to check if a string contains legal UTF8?


One could suggest that we should make length() return undef for bad
UTF-8 strings (inside 'use utf8') and no warnings.

Problem with this proposal is that it would make length() much slower.
Today it only looks at the first byte of each UTF-8 byte-sequence and
then skips all the 10xx xxxx bytes without looking at them.  Problem
with your string is that both '\xFC' and '\xDF' are legal UTF-8 start bytes,
and:

   utf8skip['\xFC'] is 6
   utf8skip['\xDF'] is 2

So when length() sees '\xFC' it just skip the next 5 characters and when
it sees '\xDF' it skip the final 'e'.  This gives the result you got and
no warnings.

I agree that there should be some simple way to determine if a
sequence is valid UTF-8.  Some new pragma to make length() more
careful?

-- 
Gisle Aas