Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

that is because Unicode does not have any codepoints above
U+10FFFF, and therefore such sequences could not be generated. I
understand that there are also no assigned ISO 10646 codepoints
above 10FFFF, and also that there will be no future assignments of
such.


 Bruce>  From the point of view of parsing some stream of octets,
 Bruce> according to one "utf-8" specification a certain sequence *is*
 Bruce> a utf-8 sequence, and according to other "utf-8"
 Bruce> specifications is is *not* a utf-8 sequence.

If instead you ask the question "is this a sequence of Unicode (or
ISO10646) characters encoded in UTF-8", then the difference between
versions disappears. (Given a 5-byte sequence, you can either reject
it as not in the spec, or decode it to the appropriate codepoint and
then reject _that_ as being out of range.)

The distinction is solely in whether you regard utf-8 as a
representation of Unicode characters, or as a way to encode arbitrary
31-bit values.

 Bruce> I stand by my assertion that (from the aforementioned point of
 Bruce> view) there are multiple incompatible "utf-8" specifications.

You are splitting a theoretical hair that is irrelevent in practice.

-- 
Andrew.

<Prev in Thread]

Current Thread

[Next in Thread>

Previous by Date:

Re: prevervation of installed base, Andrew Gierth

Next by Date:

Re: prevervation of installed base, Russ Allbery

Previous by Thread:

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Bruce Lilly

Next by Thread:

Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying), Bruce Lilly

Indexes:

[Date] [Thread] [Top] [All Lists]