perl-unicode

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0)

2002-03-10 16:25:31
At 19:39 2002-03-10 +0800, Autrijus Tang wrote:
[...] * Probe the first few thousand bytes of the file, looking for
   [\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.
[...]

Or alternately, just don't commit on the UTF8/raw question until the first highbit character(s) are seen -- since until then, the distinction is purely academic.

BTW, what is a good regexp to match UTF8 bytes? Every time I look at RFC 2279 (or p47 of the Unicode Standard 3.0 book), I feel stupider and stupider that it's not clearer to me (or alternately, angrier and angrier that the spec-writers didn't make this clearer). In perlpodspec, I wrote:

<<
A naive but sufficient heuristic for testing the first highbit bytesequence in a BOMless file (whether in code or in Pod!), to see whether that sequence is valid as UTF8 (RFC 2279) is to check whether that the first byte in the sequence is in the range 0xC0-0xFD /and/ whether the next byte is in the range 0x80-0xBF. If so, the parser may conclude that this file is in UTF8, and all highbit sequences in the file should be assumed to be UTF8. Otherwise the parser should treat the file as being in Latin1.
>>

I don't know where I got CO from, versus your CD.


--
Sean M. Burke    sburke(_at_)cpan(_dot_)org    http://www.spinn.net/~sburke/