Re: distinguishing between utf-8 and gb18030


In <20030116101947(_dot_)32be02ad(_dot_)moore(_at_)cs(_dot_)utk(_dot_)edu> 
Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

question:

in an environment where either utf-8 or gb18030 may appear, how reliably
can gb18030 and utf-8 strings be identified and distinguished from one another?
offhand it appears that many gb18030 strings are also valid utf-8 strings.


Andrew Gierth did some experiments on this, using a whole week's worth of
news on supernews. It was found to be far more reliable that any of us had
imagined. I forget the exact false positive rate, but it was exceedingly
low. It did not distinguish the various gb*s from each other or from other
non-UTF-8, of course.

-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: UTF-8 over RFC 2047 (Re: Call for Usefor to recharter), Charles Lindsey

Next by Date:

Re: distinguishing between utf-8 and gb18030, Charles Lindsey

Previous by Thread:

Re: distinguishing between utf-8 and gb18030, Claus Färber

Next by Thread:

RE: For shame (Was: Re: RFC 2047 and gatewaying), David Barr

Indexes:

[Date] [Thread] [Top] [All Lists]