ietf-822
[Top] [All Lists]

distinguishing between utf-8 and gb18030

2003-01-16 08:23:06

people often claim that it's okay to use utf-8 without tagging in contexts
where other charsets may also appear, because sufficiently long strings of
utf-8 can be distinguished (more or less reliably) from other charsets by
checking to see if the string is valid utf-8.

question:

in an environment where either utf-8 or gb18030 may appear, how reliably
can gb18030 and utf-8 strings be identified and distinguished from one another?
offhand it appears that many gb18030 strings are also valid utf-8 strings.

Keith