"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
that is because Unicode does not have any codepoints above
U+10FFFF, and therefore such sequences could not be generated. I
understand that there are also no assigned ISO 10646 codepoints
above 10FFFF, and also that there will be no future assignments of
such.
Bruce> From the point of view of parsing some stream of octets,
Bruce> according to one "utf-8" specification a certain sequence *is*
Bruce> a utf-8 sequence, and according to other "utf-8"
Bruce> specifications is is *not* a utf-8 sequence.
If instead you ask the question "is this a sequence of Unicode (or
ISO10646) characters encoded in UTF-8", then the difference between
versions disappears. (Given a 5-byte sequence, you can either reject
it as not in the spec, or decode it to the appropriate codepoint and
then reject _that_ as being out of range.)
The distinction is solely in whether you regard utf-8 as a
representation of Unicode characters, or as a way to encode arbitrary
31-bit values.
Bruce> I stand by my assertion that (from the aforementioned point of
Bruce> view) there are multiple incompatible "utf-8" specifications.
You are splitting a theoretical hair that is irrelevent in practice.
--
Andrew.