At 19:39 2002-03-10 +0800, Autrijus Tang wrote:
[...] * Probe the first few thousand bytes of the file, looking for
[\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.
[...]
Or alternately, just don't commit on the UTF8/raw question until the first
highbit character(s) are seen -- since until then, the distinction is
purely academic.
BTW, what is a good regexp to match UTF8 bytes? Every time I look at RFC
2279 (or p47 of the Unicode Standard 3.0 book), I feel stupider and
stupider that it's not clearer to me (or alternately, angrier and angrier
that the spec-writers didn't make this clearer). In perlpodspec, I wrote:
<<
A naive but sufficient heuristic for testing the first highbit bytesequence
in a BOMless file (whether in code or in Pod!), to see whether that
sequence is valid as UTF8 (RFC 2279) is to check whether that the first
byte in the sequence is in the range 0xC0-0xFD /and/ whether the next byte
is in the range 0x80-0xBF. If so, the parser may conclude that this file is
in UTF8, and all highbit sequences in the file should be assumed to be
UTF8. Otherwise the parser should treat the file as being in Latin1.
>>
I don't know where I got CO from, versus your CD.
--
Sean M. Burke sburke(_at_)cpan(_dot_)org http://www.spinn.net/~sburke/