Re: Encode, take five (malformed UTF-8)

On Wed, Sep 13, 2000 at 01:33:33AM +0100, Markus Kuhn wrote:

Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC:

        '7'             UTF-7
        '8'             UTF-8
        '16be'          UTF-16 big-endian
        '16le'          UTF-16 little-endian
        '16ne'          UTF-16 native-endian
        '32be'          UTF-32 big-endian
        '32le'          UTF-32 little-endian
        '32ne'          UTF-32 native-endian


I would somehow prefer

          '7'             UTF-7
          '8'             UTF-8
          '16be'          UTF-16 big-endian
          '16le'          UTF-16 little-endian
        ! '16'            UTF-16 native-endian
          '32be'          UTF-32 big-endian
          '32le'          UTF-32 little-endian
        ! '32'            UTF-32 native-endian

No need to introduce new acronyms and terms such as "ne".


True.

=head2 Handling Malformed Data


What exactly is malformed UTF-8 data here?

Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.

Does it also cover overlong UTF-8 sequences, i.e. any string
containing any of the five bit sequences

  1100000x,
  11100000 100xxxxx,
  11110000 1000xxxx,
  11111000 10000xxx,
  11111100 100000xx

Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
not occur in proper UTF-8 and UTF-32 data according to the standard
(see note 3 in section R.4 of UCS)?

It might be useful, if the spec were clearer here.


Thanks for the info.

References:

  - ISO/IEC 10646-1:1993(E), Amd. 2,
    http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

  - http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen