Re: UTF-8

Jarkko Hietaniemi wrote on 2002-03-10 23:27 UTC:

 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte
   U+0000..U+007F       00..7F   
   U+0080..U+07FF       C2..DF    80..BF   
   U+0800..U+0FFF       E0        A0..BF    80..BF  
   U+1000..U+CFFF       E1..EC    80..BF    80..BF  
   U+D000..U+D7FF       ED        80..9F    80..BF  
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF  
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

It seems that I have something funny in utf8.h after then 7F here,
so you may see something funny, too...


The reason why C0-C1 is missing here after 7F is that the above table
also suppresses overlong UTF-8 sequences.

See

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

for a list of all bit patterns that have to be excluded to avoid
overlong UTF-8 sequences.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0), Jarkko Hietaniemi

Next by Date:

Re: My favorite bug to fix for 5.8.0, Nick Ing-Simmons

Previous by Thread:

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0), Jarkko Hietaniemi

Next by Thread:

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0), Larry Wall

Indexes:

[Date] [Thread] [Top] [All Lists]