Andrew Gierth wrote:
"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
Bruce> From the point of view of parsing some stream of octets,
Bruce> according to one "utf-8" specification a certain sequence *is*
Bruce> a utf-8 sequence, and according to other "utf-8"
Bruce> specifications is is *not* a utf-8 sequence.
If instead you ask the question "is this a sequence of Unicode (or
ISO10646) characters encoded in UTF-8", then the difference between
versions disappears. (Given a 5-byte sequence, you can either reject
it as not in the spec, or decode it to the appropriate codepoint and
then reject _that_ as being out of range.)
That's an irrelevant question to the issue of parsing a stream of
octets. You're talking about semantic interpretation of what has
presumably already been parsed.
The distinction is solely in whether you regard utf-8 as a
representation of Unicode characters, or as a way to encode arbitrary
31-bit values.
That too is irrelevant; for parsing an octet stream, what is
represented by the stream (again, semantic interpretation rather
than syntactic) is not of concern.
Bruce> I stand by my assertion that (from the aforementioned point of
Bruce> view) there are multiple incompatible "utf-8" specifications.
You are splitting a theoretical hair that is irrelevent in practice.
It is quite relevant to parsing, which precedes semantic interpretation.