On Wed, Sep 13, 2000 at 01:33:33AM +0100, Markus Kuhn wrote:
Jarkko Hietaniemi wrote on 2000-09-12 23:42 UTC:
'7' UTF-7
'8' UTF-8
'16be' UTF-16 big-endian
'16le' UTF-16 little-endian
'16ne' UTF-16 native-endian
'32be' UTF-32 big-endian
'32le' UTF-32 little-endian
'32ne' UTF-32 native-endian
I would somehow prefer
'7' UTF-7
'8' UTF-8
'16be' UTF-16 big-endian
'16le' UTF-16 little-endian
! '16' UTF-16 native-endian
'32be' UTF-32 big-endian
'32le' UTF-32 little-endian
! '32' UTF-32 native-endian
No need to introduce new acronyms and terms such as "ne".
True.
=head2 Handling Malformed Data
What exactly is malformed UTF-8 data here?
Obviously at least everything listed in section R.7 of ISO 10646-1/Amd.2.
Does it also cover overlong UTF-8 sequences, i.e. any string
containing any of the five bit sequences
1100000x,
11100000 100xxxxx,
11110000 1000xxxx,
11111000 10000xxx,
11111100 100000xx
Does it also cover UTF-8 encoded code positions U+D800 to U+DFFF (UTF-16
surrogates) as well as U+FFFE (anti-BOM) and U+FFFF, all of which must
not occur in proper UTF-8 and UTF-32 data according to the standard
(see note 3 in section R.4 of UCS)?
It might be useful, if the spec were clearer here.
Thanks for the info.
References:
- ISO/IEC 10646-1:1993(E), Amd. 2,
http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html
- http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen