Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)


Andrew Gierth wrote:

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:

that was hardly an answer to my question of Sep 30th which was part
of a different branch of the discussion.


An example of a difference serves as that, and I was not inclided to waste
bandwidth repeating the same example in such a short span of time.  In any
event you now have it again.

 Bruce> U+E0001 U+E0065 U+E006E U+0066 U+006F U+006F U+E0001 U+E007F

 Bruce> Transformed to Unicode 3.1 UTF-8 (hex values for 8-bit codes):

 Bruce> F3 A0 80 81 F3 A0 81 A5 F3 A0 81 AE foo F3 A0 80 81 F3 A0 81 8F

 Bruce> The Unicode 2.0 UTF-8 reverse transformation yields:

 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
 Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F

 Bruce> That has 5 surrogate pairs which were not in the original
 Bruce> Unicode 3.1 text.

but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.

The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.


It has to do with the difference in the two utf-8 specifications, specifically
for 4-byte sequences, which in one case (but not the other) involve surrogate
pairs.  The original had 8 "characters" (actually at most 5 could really be
called characters in any meaningful sense; the others are syntactic glue),
whereas the result has 13 16-bit "characters", 10 of which are "garbage".

One "utf-8" encoding which transforms between uniform-width 16-bit "characters"
and a sequence of bytes cannot be said to be "the same as" another "utf-8"
specification which transforms between variable-width "characters" and a series
of bytes; they are different because the domain of one side of the 
transformation
differs.  Now you're saying that with yet another transformation, one can
get from a to c by way of an intermediate q, but that is not the same as saying
that a->b (Unicode 3.1+ to "utf-8") and b->c ("utf-8" to Unicode 3.0 and 
earlier)
are parts of the "same" transformation.  The "utf-8" transformation could be
said to be "the same" in that case if and only if a == c, but since a and c are
quite different, the two "utf-8" transformations are necessarily different.

 [snip language-tagging stuff irrelevent to this issue]


The language tagging is relevant for a number of reasons:
1. it bears on the disadvantages of untagged utf-8 (vs. properly
   RFC 2047/2231 tagged utf-8)
2. it is an example of the differences in the utf-8 specifications
   (since the 4-byte sequences which differ are involved)
3. it illustrates the horribly baroque Unicode 3.x encoding of
   characters making up ISO 639 language tags. One of the objections
   to RFC 2047 voiced has been that it uses some additional octets
   compared to raw 8-bit data -- Unicode 3.x langauage tags are much
   worse as (via utf-8) they transform the 7-bit characters which
   comprise ISO 639 language tags into long sequences of octets.
4. If there is a proposal to use untagged utf-8 instead of properly
   tagged (RFC 2047 / 2231) charsets (including but not limited to
   utf-8), then the entirety of the repercussions of such a proposal
   ought to be considered, and language tagging has long been part
   of MIME, but is a recent incompatible (among Unicode versions)
   addition to Unicode, and one which is also incompatible (per
   Unicode standards) with MIME.

 Bruce> That of course is not a parsing issue, but one of semantic
 Bruce> interpretation of the "utf-8".  According to one of the many
 Bruce> "utf-8" specifications, the "utf-8" stream encodes a
 Bruce> language-tagged string, while in at least one other "utf-8"
 Bruce> specification it encodes something quite different.

it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.


It is related to the different utf-8 specifications inasmuch as the
different specifications in conjunction with the change from a 16-bit
pre-3.1 character width and the newly-introduced codepoints (including
language tags) together result in obfuscated content with a different
number of bits.  It's difficult to cleanly separate those issues, since
if musical notes, baroque encoding of language tags etc. hadn't been
added (deviating from the original Unicode Design Priciples), the
character width wouldn't have had to be increased beyond 16 bits and the
utf-8 4-byte sequences wouldn't need to be specified differently. It
is precisely because the Unicode versions are different that the "utf-8"
transformations specifications are necessarily different.