Keith,
Although well intentioned, you are pursuing a question that is
fundamentally irrelevant to an IETF technical effort.
The IETF does not produce technical specifications that use heuristics.
A specification needs to provide a precise (and efficient) means of
distinguishing different data.
The line of question you are asking is based on data analysis that is
either a heuristic or requires an unbounded (or at least far too large a
memory) lexical analyzer.
Given the difficulty of getting clear, precise and workable transition
and operations analysis from the proponents of "just use UTF-8" and
given their failure to attend to the extended discussion of such a model
10 years ago, pursuing your question provides them encouragement where
none should exist.
d/
Thursday, January 16, 2003, 7:19:47 AM, you wrote:
Keith> people often claim that it's okay to use utf-8 without tagging in
contexts
Keith> where other charsets may also appear, because sufficiently long strings
of
Keith> utf-8 can be distinguished (more or less reliably) from other charsets by
Keith> checking to see if the string is valid utf-8.
Keith> question:
Keith> in an environment where either utf-8 or gb18030 may appear, how reliably
Keith> can gb18030 and utf-8 strings be identified and
Keith> distinguished from one another?
d/
--
Dave <mailto:dcrocker(_at_)brandenburg(_dot_)com>
Brandenburg InternetWorking <http://www.brandenburg.com>
t +1.408.246.8253; f +1.408.850.1850