perl-unicode

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0)

2002-03-10 16:27:44
BTW, what is a good regexp to match UTF8 bytes?  Every time I look at RFC 
2279 (or p47 of the Unicode Standard 3.0 book), I feel stupider and 
stupider that it's not clearer to me (or alternately, angrier and angrier 
that the spec-writers didn't make this clearer).  In perlpodspec, I wrote:

cut-and-paste from the latest utf8.h:

 The following table is from Unicode 3.2.

 Code Points            1st Byte  2nd Byte  3rd Byte  4th Byte

   U+0000..U+007F       00..7F   
   U+0080..U+07FF       C2..DF    80..BF   
   U+0800..U+0FFF       E0        A0..BF    80..BF  
   U+1000..U+CFFF       E1..EC    80..BF    80..BF  
   U+D000..U+D7FF       ED        80..9F    80..BF  
   U+D800..U+DFFF       ******* ill-formed *******
   U+E000..U+FFFF       EE..EF    80..BF    80..BF  
  U+10000..U+3FFFF      F0        90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

Note the A0..BF in U+0800..U+0FFF, the 80..9F in U+D000...U+D7FF,
the 90..BF in U+10000..U+3FFFF, and the 80...8F in U+100000..U+10FFFF.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen