perl-unicode

UTF-8 (strict) appears borken

2008-03-12 17:39:58

1. 'Ill-formed' UTF-8
=====================

The Unicode Standard specifies that any UTF-8 sequence that does not
correspond to this table is 'ill-formed':

   Code Points        | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
   -------------------+----------+----------+----------+----------+
     U+0000..U+007F   |  00..7F  |    --    |    --    |    --    |
     U+0080..U+07FF   |  C2..DF  |  80..BF  |    --    |    --    |
     U+0800..U+0FFF   |    E0    |  A0..BF  |  80..BF  |    --    |
     U+1000..U+CFFF   |  E1..EC  |  80..BF  |  80..BF  |    --    |
     U+D000..U+D7FF   |    ED    |  80..9F  |  80..BF  |    --    |
     U+E000..U+FFFF   |  EE..EF  |  80..BF  |  80..BF  |    --    |
    U+10000..U+3FFFF  |    F0    |  90..BF  |  80..BF  |  80..BF  |
    U+40000..U+FFFFF  |  F1..F3  |  80..BF  |  80..BF  |  80..BF  |
   U+100000..U+10FFFF |    F4    |  80..8F  |  80..BF  |  80..BF  |

Note in particular that:

  - anything beyond U+10FFFF is ill-formed.

  - anything U+D800..U+DFFF is ill-formed.

  - only one encoding for each Code Point is well-formed.

We'd expect UTF-8 decode to spot ill-formed sequences.  Though some
special handling of incomplete sequences at the end of a buffer would be
handy.

We'd expect UTF-8 encode to only generate well-formed sequences.

2. Extended Sequences
=====================

Unicode and ISO/IEC 10646:2003 define meanings for UTF-8 compatible
sequences up to 6 bytes, which allows for characters up to 0x7FFF_FFFF.

The Unicode reference code for reading UTF-8 recognises these extended
sequences as being single entities (though ill-formed).

Perl has its own further 7 and 13 byte forms, allowing for characters up
to 0xF_FFFF_FFFF and 2^72-1, respectively.  These are beyond UTF-8.

3. Non-Characters
=================

The only other cause for concern are non-characters.  These are:

  * U+FFFE and U+FFFF and the last two code points in every other
    Unicode plane.

    Unicode code space is divided into 17 'planes' of 65,536 characters,
    each.  So characters U+01_FFFE, U+01_FFFF, U+02_FFFE, U+02_FFFF, ...
    U+10_FFFE and U+10_FFFF are all non-characters.

  * U+FDD0..U+FDEF

Now, Unicode 5.0.0 says:

  "Applications are free to use any of these noncharacter code points
   internally but should never attempt to exchange them. If a
   noncharacter is received in open interchange, an application is not
   required to interpret it in any way. It is good practice, however,
   to recognize it as a noncharacter and to take appropriate action,
   such as removing it from the text."

  "Noncharacter code points are reserved for internal use, such as for
   sentinel values. They should never be interchanged. They do, however,
   have well-formed representations in Unicode encoding forms and
   survive conversions between encoding forms. This allows sentinel
   values to be preserved internally across Unicode encoding forms, even
   though they are not designed to be used in open interchange."

So... this is not so clear-cut.  For "open interchange" UTF-8 should
disallow the non-characters.  However, for local storage of Unicode
stuff, non-characters should be allowed.

4. What 'UTF-8' Does
====================

Ill-formed sequences -- fine (mostly):

  * UTF-8 decode treats these as errors, and will stop or use fallback
    decoding as required.

    The default fallback is:

      - errors for sequence <= 0x7FFF_FFFF -- replaced by U+FFFD

        *** information is being lost, here :-(

      - anything else: each byte which is not recognised as being part
        of a complete 2..6 byte sequence is replaced by U+FFFD

        *** so one cannot distinguish ill-formed sequences from
            out of range characters.

    The PERLQQ, HTMLCREF and XMLCREF fallbacks are:

      - errors for sequence <= 0x7FFF_FFFF -- replaced by the
        respective escape sequence for the character value.

        This ought to work if the data is HTML or XML, where new escape
        sequences fit right in if HTMLCREF or XMLCREF is used.

        *** PERLQQ, however, may fail if '\' appears in the input and
            the sender has not escaped it !

            Perhaps PERLQQ should escape '\' that appear in the input ?

        *** In all cases, however, all that's been achieved is that
            non-UTF-8 characters have been transliterated.  It's still
            a puzzle what may be done with these characters !

      - anything else: each byte which is not recognised as being part
        of a complete up to 6 byte sequence is replaced by the
        respective escape sequence for the byte value.

        *** this is impossible to distinguish from escaped values which
            could exist in the input !

  * UTF-8 encode will not generate ill-formed sequences and treats out
    of ranges character values as errors.  Errors will stop encoding or
    cause the fallback encoding to be used.

    The default fallback is:

      - errored characters <= 0x7FFF_FFFF -- replaced by U+FFFD

        *** Not much one can do here.  It's not clear that U+FFFD is a
            good thing to output -- one could argue for discarding
            this rubbish, instead ?

      - 0x8000_0000 and greater -- replaced by seven or thirteen U+FFFD,
        depending on the length of the Perl internal form !!!

        *** This is also more than a bit odd !!

    The PERLQQ, HTMLCREF and XMLCREF fallbacks are:

      - errored characters <= 0x7FFF_FFFF -- replaced by the
        respective escape sequence for the character value.

        This ought to work if the data is HTML or XML, where new escape
        sequences fit right in if HTMLCREF or XMLCREF is used.

        *** PERLQQ, however, may fail if '\' appears in the output and
            the sender has not escaped it !            .

            Perhaps PERLQQ should escape '\' that appear in the output ?

        *** In all cases, however, all that's been achieved is that
            non-UTF-8 characters have been transliterated.  It's still
            a puzzle what may be done with these characters !

      - 0x8000_0000 and greater -- replaced by the seven or thirteen
        bytes that comprise the Perl internal form, each as its
        respective escape sequence !!!

        *** This is also more than a bit odd !!

Incomplete sequences -- fine, but not documented !

  * UTF-8 decode generally treats these as ill-formed, as above.

    However, the STOP_AT_PARTIAL CHECK option will cause decode to stop,
    without error (so without invoking the fallback).

Non-Character Values -- inconsistent and arguable !!

  As noted above, one can argue for two approaches here, depending on
  whether the data being en/decoded is internal or external.

  For internal data, non-characters are valid and should be preserved.

  For external data, non-characters should not be sent or received.  One
  can debate whether they should be dropped or replaced or escaped.

  UTF-8 encode/decode recognise only U+FFFF as a non-character, and
  treat it as an error.

  *** This looks like a bug.  If non-character values are to be treated
      as errors, I suggest all non-character values should be so
      treated.

  *** This caters only for external data exchange.

  The error handling is as for ill-formed sequences, see above.

5. Conclusion 'UTF-8' is broken
===============================

 * the non-character handling is incomplete.

 * it can be argued that there should be an option to accept/allow non-
   character values.

 * the various fallback options are all less than satisfactory in their
   own way.

   One can see why the ref:Sub CHECK argument was invented.

   HOWEVER: it would be handy if there was a second parameter passed to
   the CHECK subroutine, telling it *why* the given sequence cannot
   be encoded/decoded, in particular:

     -- out of range character value

     -- ill-formed sequence (and could pass in everything up to the next
        not invalid byte ?)

     -- non-character

     -- incomplete sequence

   for otherwise the subroutine has to do all the work to figure this
   out for itself !

------------------------------------------------------------------------

It is clear that what data is valid, and how to deal with invalid data,
is really up to the application.  Trying to be helpful in Encode/Decode
is apparently tricky.

It is also clear that a lot of heavy duty character/byte bashing would
be better if it could be provided in XS land.

However, thinking about some simple but general mechanism for this is
making my head hurt.

[I'm going to go away now, and lie down.]
-- 
Chris Hall               highwayman.com

Attachment: signature.asc
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>