Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0)

At 19:39 2002-03-10 +0800, Autrijus Tang wrote:

[...] * Probe the first few thousand bytes of the file, looking for
   [\xCD-\xFD][\x80-\xBF]. If that occurs, assume it to be UTF-8.
[...]

Or alternately, just don't commit on the UTF8/raw question until the firsthighbit character(s) are seen -- since until then, the distinction ispurely academic.

BTW, what is a good regexp to match UTF8 bytes? Every time I look at RFC2279 (or p47 of the Unicode Standard 3.0 book), I feel stupider andstupider that it's not clearer to me (or alternately, angrier and angrierthat the spec-writers didn't make this clearer). In perlpodspec, I wrote:

<<

A naive but sufficient heuristic for testing the first highbit bytesequencein a BOMless file (whether in code or in Pod!), to see whether thatsequence is valid as UTF8 (RFC 2279) is to check whether that the firstbyte in the sequence is in the range 0xC0-0xFD /and/ whether the next byteis in the range 0x80-0xBF. If so, the parser may conclude that this file isin UTF8, and all highbit sequences in the file should be assumed to beUTF8. Otherwise the parser should treat the file as being in Latin1.

>>

I don't know where I got CO from, versus your CD.


--
Sean M. Burke    sburke(_at_)cpan(_dot_)org    http://www.spinn.net/~sburke/

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: My favorite bug to fix for 5.8.0, Jarkko Hietaniemi

Next by Date:

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0), Jarkko Hietaniemi

Previous by Thread:

Automagical :text layer (was: My favorite bug to fix for 5.8.0), Autrijus Tang

Next by Thread:

Re: Automagical :text layer (was: My favorite bug to fix for 5.8.0), Jarkko Hietaniemi

Indexes:

[Date] [Thread] [Top] [All Lists]