Re: distinguishing between utf-8 and gb18030

people often claim that it's okay to use utf-8 without tagging in contexts
where other charsets may also appear, because sufficiently long strings of
utf-8 can be distinguished (more or less reliably) from other charsets by
checking to see if the string is valid utf-8.

question:

in an environment where either utf-8 or gb18030 may appear, how reliably
can gb18030 and utf-8 strings be identified and distinguished from one 
another?
offhand it appears that many gb18030 strings are also valid utf-8 strings.


Most of the time it should be easy. Both GB18030 and UTF-8 are ASCII
compatible, but the similarities stop there. Most Chinese characters in GB18030
encode as two bytes >0x07f, same as GB2312 or GBK. The rest of Unicode maps to
a set of four byte sequences. While aligning with valid UTF-8 is possible, it
isn't very likely, and the likelihood drops dramatically the more data you
have.

Distingushing GB18030 from other multibyte encodings of Chiness, Japanese, or
Korean is much trickier, however. You certainly can tell the difference from
the varius iso-2022 derived encodings with ease (and those encodings can be
distinguished from each other fairly easily too), but the structure of some of
these multibyte CJK encodings is very similar. This can change the problem from
one of looking for valid byte sequences in a particular encoding to having to
use frequency count information and the like. Not only is this much more
difficult, requiring large and complex tables, it also requires much more data
in order to be able to guess.

If you actually used GB18030 to encode text written in a script other than
Roman or Kanji you'd have something that might be recognizable as being
uniquely in GB18030.

There are companies that specialize in selling software to do this sort of
thing, you know. I haven't dealt with such software myself but I know it
exists.

                                Ned