Jarkko Hietaniemi wrote on 2002-03-10 23:27 UTC:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF ******* ill-formed *******
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
It seems that I have something funny in utf8.h after then 7F here,
so you may see something funny, too...
The reason why C0-C1 is missing here after 7F is that the above table
also suppresses overlong UTF-8 sequences.
See
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
for a list of all bit patterns that have to be excluded to avoid
overlong UTF-8 sequences.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>