Re: Encode-0.40.tar.gz

This reminds me: we shoul probably have some user-accessible method of
detection of the Unicode encodings, too: UTF-8 (well, this is really
guessing, at least without a BOM, "does this look like valid UTF-8 to
you"), but BOMs and UTF-16-foo, and UTF-32-foo.


Actually, UTF-8 autodetection can be pretty reliable (though I wouldn't
recommend it for applications). The chances that a string in another
encoding is completely free of malformed or overlong UTF-8 sequences are
pretty small. UTF-8 has enough unique syntactic rigor to make it quite
easily distinguishable from any other encoding.

There's a detailed recommendation on BOM handling by encoding converters
in

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#ucsutf

which I hope you will find helpful.

There are further recommendations for the authors of Unicode encoding
converters in

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#conv

especially on how to use field 5 in the Unicode database and the Unihan
database to construct correct Unicode-to-somethingelse mapping tables.
These sections summarise the intensive past discussions on these issues
on the linux-utf8 mailing list. I hope that you are already familiar
with them. If not, please read and consider the above sections
carefully. Thanks!

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

<Prev in Thread]	Current Thread	[Next in Thread>
Encode-0.40.tar.gz, Dan Kogai Re: Encode-0.40.tar.gz, Jarkko Hietaniemi Re: Encode-0.40.tar.gz, Dan Kogai Re: Encode-0.40.tar.gz, Jarkko Hietaniemi Re: Encode-0.40.tar.gz, Markus Kuhn <= Re: Encode-0.40.tar.gz, Jarkko Hietaniemi

Previous by Date:	Re: Encode-0.40.tar.gz, Jarkko Hietaniemi
Next by Date:	Re: Encode-0.40.tar.gz, Jarkko Hietaniemi
Previous by Thread:	Re: Encode-0.40.tar.gz, Jarkko Hietaniemi
Next by Thread:	Re: Encode-0.40.tar.gz, Jarkko Hietaniemi
Indexes:	[Date] [Thread] [Top] [All Lists]