In <20030116101947(_dot_)32be02ad(_dot_)moore(_at_)cs(_dot_)utk(_dot_)edu>
Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:
question:
in an environment where either utf-8 or gb18030 may appear, how reliably
can gb18030 and utf-8 strings be identified and distinguished from one another?
offhand it appears that many gb18030 strings are also valid utf-8 strings.
Andrew Gierth did some experiments on this, using a whole week's worth of
news on supernews. It was found to be far more reliable that any of us had
imagined. I forget the exact false positive rate, but it was exceedingly
low. It did not distinguish the various gb*s from each other or from other
non-UTF-8, of course.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk Snail: 5
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5