Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)


Andrew Gierth wrote:

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

 Bruce>  From the point of view of parsing some stream of octets,
 Bruce> according to one "utf-8" specification a certain sequence *is*
 Bruce> a utf-8 sequence, and according to other "utf-8"
 Bruce> specifications is is *not* a utf-8 sequence.

If instead you ask the question "is this a sequence of Unicode (or
ISO10646) characters encoded in UTF-8", then the difference between
versions disappears. (Given a 5-byte sequence, you can either reject
it as not in the spec, or decode it to the appropriate codepoint and
then reject _that_ as being out of range.)


That's an irrelevant question to the issue of parsing a stream of
octets.  You're talking about semantic interpretation of what has
presumably already been parsed.

The distinction is solely in whether you regard utf-8 as a
representation of Unicode characters, or as a way to encode arbitrary
31-bit values.


That too is irrelevant; for parsing an octet stream, what is
represented by the stream (again, semantic interpretation rather
than syntactic) is not of concern.

 Bruce> I stand by my assertion that (from the aforementioned point of
 Bruce> view) there are multiple incompatible "utf-8" specifications.

You are splitting a theoretical hair that is irrelevent in practice.


It is quite relevant to parsing, which precedes semantic interpretation.

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: RFC 2047 and gatewaying, Bruce Lilly

Next by Date:

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Bruce Lilly

Previous by Thread:

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Andrew Gierth

Next by Thread:

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Bruce Lilly

Indexes:

[Date] [Thread] [Top] [All Lists]