perl-unicode

Re: UTF-8

2002-03-10 18:00:33
Jarkko Hietaniemi wrote on 2002-03-10 23:27 UTC:
 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
   U+0000..U+007F       00..7F   
   U+0080..U+07FF       C2..DF    80..BF   
   U+0800..U+0FFF       E0        A0..BF    80..BF  
   U+1000..U+CFFF       E1..EC    80..BF    80..BF  
   U+D000..U+D7FF       ED        80..9F    80..BF  
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF  
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

It seems that I have something funny in utf8.h after then 7F here,
so you may see something funny, too...

The reason why C0-C1 is missing here after 7F is that the above table
also suppresses overlong UTF-8 sequences.

See

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

for a list of all bit patterns that have to be excluded to avoid
overlong UTF-8 sequences.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>