ietf-822
[Top] [All Lists]

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

2003-01-10 11:21:48

Andrew Gierth wrote:
"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

 Bruce>  From the point of view of parsing some stream of octets,
 Bruce> according to one "utf-8" specification a certain sequence *is*
 Bruce> a utf-8 sequence, and according to other "utf-8"
 Bruce> specifications is is *not* a utf-8 sequence.

If instead you ask the question "is this a sequence of Unicode (or
ISO10646) characters encoded in UTF-8", then the difference between
versions disappears. (Given a 5-byte sequence, you can either reject
it as not in the spec, or decode it to the appropriate codepoint and
then reject _that_ as being out of range.)

That's an irrelevant question to the issue of parsing a stream of
octets.  You're talking about semantic interpretation of what has
presumably already been parsed.

The distinction is solely in whether you regard utf-8 as a
representation of Unicode characters, or as a way to encode arbitrary
31-bit values.

That too is irrelevant; for parsing an octet stream, what is
represented by the stream (again, semantic interpretation rather
than syntactic) is not of concern.

 Bruce> I stand by my assertion that (from the aforementioned point of
 Bruce> view) there are multiple incompatible "utf-8" specifications.

You are splitting a theoretical hair that is irrelevent in practice.

It is quite relevant to parsing, which precedes semantic interpretation.