Counting characters (was Re: matching multibyte utf-8 in perl)

On Friday 28 February 2003 09:03 pm, John Kilbourne wrote:

I am a beginner as well, with the task of finding and counting the
non-ascii characters in a utf-8 text. How do I do this?


That depends on what you want to accomplish. 

Counting Unicode code points is easy. ASCII characters have the form 
0x0bbbbbbb in UTF-8. Initial bytes of non-ASCII character encodingss have the 
form 10bbbbbb. All other bytes in UTF-8 streams have the form 11bbbbbb. So 
matching the range 10000000-10111111 (hex 80-BF) will suffice.

If you want to count text characters while ignoring control characters and 
undefined code points, or to count base character +modifier sequences as 
single characters, or to count glyphs in the rendering, you need to have a 
precise set of definitions suited to your application, and know a good deal 
about the details of Unicode.
-- 
Edward Cherlin
Generalist & activist--Linux, languages, literacy and more
"A knot! Oh, do let me help to undo it!"
--Alice in Wonderland

Previous by Date:	matching multibyte utf-8 in perl, John Kilbourne
Next by Date:	[ANN] Unicode::Normalize 0.20 released, SADAHIRO Tomoyuki
Previous by Thread:	matching multibyte utf-8 in perl, John Kilbourne
Next by Thread:	[ANN] Unicode::Normalize 0.20 released, SADAHIRO Tomoyuki
Indexes:	[Date] [Thread] [Top] [All Lists]