ietf-822
[Top] [All Lists]

Re: distinguishing between utf-8 and gb18030

2003-01-17 14:52:29

In <20030116101947(_dot_)32be02ad(_dot_)moore(_at_)cs(_dot_)utk(_dot_)edu> 
Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

question:

in an environment where either utf-8 or gb18030 may appear, how reliably
can gb18030 and utf-8 strings be identified and distinguished from one another?
offhand it appears that many gb18030 strings are also valid utf-8 strings.

Andrew Gierth did some experiments on this, using a whole week's worth of
news on supernews. It was found to be far more reliable that any of us had
imagined. I forget the exact false positive rate, but it was exceedingly
low. It did not distinguish the various gb*s from each other or from other
non-UTF-8, of course.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread] Current Thread [Next in Thread>