Encode::is_utf8, v2.18, checking for well formed UTF-8, bug ?


Encode::is_utf8, v2.18, checking for well formed UTF-8, bug ?

I understand that is_utf8(<string>, 1) will check whether the given
string contains well-formed UTF-8 -- having forced the string to
utf8.

Experiment shows that this does indeed reject strings that contain:

  - any invalid bytes, ie:

     * bytes 0x80..0xBF outside a sequences
     * bytes outside 0x80..0xBF inside a sequence

  - any redundant UTF-8 sequences, ie any sequence which is well-formed,
    but for which a shorter sequence exists.

So far, so good.

It also rejects all sequences in the range:

  U+0014_0000: 0: \xF5\x80\x80\x80
  U+001F_FFFF: 0: \xF7\xBF\xBF\xBF

But otherwise accepts all sequences between U+0080: \xC2\x80 and
U+7FFF_FFFF: \xFD\xBF\xBF\xBF\xBF\xBF.

I am content that the definition of utf8 allows for character values at
least 0x00..0x7FFF_FFFF.  But there is a hole in the range !  Bug ??

It would be useful to have a check that spots:

  - U+D800..U+DFFF  -- nonsense values

  - U+FFFD  -- though could be meaningful
  - U+FFFE  -- though may be being used for BOM
  - U+FFFF  -- not really expected

  - characters beyond U+10_FFFF  -- nonsense values

Running across either a byte string or an already utf8 string.

A smart check could return a bit mask, so that one could detect the
presence of each of the above cases (and others that I don't know of).

Actually, could also spot BOM marker(s) ?

I know that this can be done by decode/encode with UTF-8:

  - decode('UTF-8', string)

    inserts U+FFFD for: U+D800..U+DFFF, U+FFFF and anything beyond
    U+10_FFFF.

    It leaves U+FFFD and U+FFFE.

    To detect invalids one has to look in the decoded string for
    \x{FFFD} or \x{FFFE}.

  - encode('UTF-8', string, 1)

    will croak for U+FFFD for: U+D800..U+DFFF, U+FFFF and anything
    beyond U+10_FFFF.

    It leaves U+FFFD and U+FFFE.  To detect those one has scan the
    encoded string.

But we seem to be doing a lot of work here...  and apparently copying
strings around to no good effect.  (Though, I guess that at some point
one will have to decode the string, if it is valid 'UTF-8'.)

Chris

PS: I find that decode('UTF-8', string, sub { $n++ ; return '?' ; })
    simply doesn't work !

    That is, the embedded sub does not appear to be called, but decode
    seems to stop at the first error, and quietly give up, returning
    the partly decoded string.
-- 
Chris Hall