ietf-822
[Top] [All Lists]

Re: distinguishing between utf-8 and gb18030

2003-01-16 08:46:04

Keith,

Although well intentioned, you are pursuing a question that is
fundamentally irrelevant to an IETF technical effort.

The IETF does not produce technical specifications that use heuristics.
A specification needs to provide a precise (and efficient) means of
distinguishing different data.

The line of question you are asking is based on data analysis that is
either a heuristic or requires an unbounded (or at least far too large a
memory) lexical analyzer.

Given the difficulty of getting clear, precise and workable transition
and operations analysis from the proponents of "just use UTF-8" and
given their failure to attend to the extended discussion of such a model
10 years ago, pursuing your question provides them encouragement where
none should exist.

d/

Thursday, January 16, 2003, 7:19:47 AM, you wrote:

Keith> people often claim that it's okay to use utf-8 without tagging in 
contexts
Keith> where other charsets may also appear, because sufficiently long strings 
of
Keith> utf-8 can be distinguished (more or less reliably) from other charsets by
Keith> checking to see if the string is valid utf-8.

Keith> question:

Keith> in an environment where either utf-8 or gb18030 may appear, how reliably
Keith> can gb18030 and utf-8 strings be identified and
Keith> distinguished from one another?


d/
-- 
 Dave <mailto:dcrocker(_at_)brandenburg(_dot_)com>
 Brandenburg InternetWorking <http://www.brandenburg.com>
 t +1.408.246.8253; f +1.408.850.1850