Encode::is_utf8, v2.18, checking for well formed UTF-8, bug ?
I understand that is_utf8(<string>, 1) will check whether the given
string contains well-formed UTF-8 -- having forced the string to
utf8.
Experiment shows that this does indeed reject strings that contain:
- any invalid bytes, ie:
* bytes 0x80..0xBF outside a sequences
* bytes outside 0x80..0xBF inside a sequence
- any redundant UTF-8 sequences, ie any sequence which is well-formed,
but for which a shorter sequence exists.
So far, so good.
It also rejects all sequences in the range:
U+0014_0000: 0: \xF5\x80\x80\x80
U+001F_FFFF: 0: \xF7\xBF\xBF\xBF
But otherwise accepts all sequences between U+0080: \xC2\x80 and
U+7FFF_FFFF: \xFD\xBF\xBF\xBF\xBF\xBF.
I am content that the definition of utf8 allows for character values at
least 0x00..0x7FFF_FFFF. But there is a hole in the range ! Bug ??
It would be useful to have a check that spots:
- U+D800..U+DFFF -- nonsense values
- U+FFFD -- though could be meaningful
- U+FFFE -- though may be being used for BOM
- U+FFFF -- not really expected
- characters beyond U+10_FFFF -- nonsense values
Running across either a byte string or an already utf8 string.
A smart check could return a bit mask, so that one could detect the
presence of each of the above cases (and others that I don't know of).
Actually, could also spot BOM marker(s) ?
I know that this can be done by decode/encode with UTF-8:
- decode('UTF-8', string)
inserts U+FFFD for: U+D800..U+DFFF, U+FFFF and anything beyond
U+10_FFFF.
It leaves U+FFFD and U+FFFE.
To detect invalids one has to look in the decoded string for
\x{FFFD} or \x{FFFE}.
- encode('UTF-8', string, 1)
will croak for U+FFFD for: U+D800..U+DFFF, U+FFFF and anything
beyond U+10_FFFF.
It leaves U+FFFD and U+FFFE. To detect those one has scan the
encoded string.
But we seem to be doing a lot of work here... and apparently copying
strings around to no good effect. (Though, I guess that at some point
one will have to decode the string, if it is valid 'UTF-8'.)
Chris
PS: I find that decode('UTF-8', string, sub { $n++ ; return '?' ; })
simply doesn't work !
That is, the embedded sub does not appear to be called, but decode
seems to stop at the first error, and quietly give up, returning
the partly decoded string.
--
Chris Hall