Re: Encode, take five (malformed UTF-8)

=head2 Handling Malformed Data


What exactly is malformed UTF-8 data here?

Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.

Does it also cover overlong UTF-8 sequences, i.e. any string
containing any of the five bit sequences

  1100000x,
  11100000 100xxxxx,
  11110000 1000xxxx,
  11111000 10000xxx,
  11111100 100000xx

Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
not occur in proper UTF-8 and UTF-32 data according to the standard
(see note 3 in section R.4 of UCS)?


At the moment I don't know.  I haven't looked at the UTF-8 {en,de}coding
code to see which of these are deemed malformed.
 
-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Previous by Date:	Re: Encode, take five (malformed UTF-8), Jarkko Hietaniemi
Next by Date:	Re: Encode, take five, Jarkko Hietaniemi
Previous by Thread:	Re: Encode, take five (malformed UTF-8), Jarkko Hietaniemi
Next by Thread:	Re: Encode, take five, Jarkko Hietaniemi
Indexes:	[Date] [Thread] [Top] [All Lists]