Andrew Gierth wrote:
"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:
that was hardly an answer to my question of Sep 30th which was part
of a different branch of the discussion.
An example of a difference serves as that, and I was not inclided to waste
bandwidth repeating the same example in such a short span of time. In any
event you now have it again.
Bruce> U+E0001 U+E0065 U+E006E U+0066 U+006F U+006F U+E0001 U+E007F
Bruce> Transformed to Unicode 3.1 UTF-8 (hex values for 8-bit codes):
Bruce> F3 A0 80 81 F3 A0 81 A5 F3 A0 81 AE foo F3 A0 80 81 F3 A0 81 8F
Bruce> The Unicode 2.0 UTF-8 reverse transformation yields:
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F
Bruce> U+DBC0 U+DC01 U+DBC0 U+DC7F
Bruce> That has 5 surrogate pairs which were not in the original
Bruce> Unicode 3.1 text.
but those surrogate pairs are precisely the ones that represent the
original codepoints. i.e. if you merely converted the original
character sequence into UTF-16, you would get the same result.
>
The fact that those codepoints may not be known to the receiving
application is to do with the difference in the Unicode specification
itself, and nothing to do with UTF-8.
It has to do with the difference in the two utf-8 specifications, specifically
for 4-byte sequences, which in one case (but not the other) involve surrogate
pairs. The original had 8 "characters" (actually at most 5 could really be
called characters in any meaningful sense; the others are syntactic glue),
whereas the result has 13 16-bit "characters", 10 of which are "garbage".
One "utf-8" encoding which transforms between uniform-width 16-bit "characters"
and a sequence of bytes cannot be said to be "the same as" another "utf-8"
specification which transforms between variable-width "characters" and a series
of bytes; they are different because the domain of one side of the
transformation
differs. Now you're saying that with yet another transformation, one can
get from a to c by way of an intermediate q, but that is not the same as saying
that a->b (Unicode 3.1+ to "utf-8") and b->c ("utf-8" to Unicode 3.0 and
earlier)
are parts of the "same" transformation. The "utf-8" transformation could be
said to be "the same" in that case if and only if a == c, but since a and c are
quite different, the two "utf-8" transformations are necessarily different.
[snip language-tagging stuff irrelevent to this issue]
The language tagging is relevant for a number of reasons:
1. it bears on the disadvantages of untagged utf-8 (vs. properly
RFC 2047/2231 tagged utf-8)
2. it is an example of the differences in the utf-8 specifications
(since the 4-byte sequences which differ are involved)
3. it illustrates the horribly baroque Unicode 3.x encoding of
characters making up ISO 639 language tags. One of the objections
to RFC 2047 voiced has been that it uses some additional octets
compared to raw 8-bit data -- Unicode 3.x langauage tags are much
worse as (via utf-8) they transform the 7-bit characters which
comprise ISO 639 language tags into long sequences of octets.
4. If there is a proposal to use untagged utf-8 instead of properly
tagged (RFC 2047 / 2231) charsets (including but not limited to
utf-8), then the entirety of the repercussions of such a proposal
ought to be considered, and language tagging has long been part
of MIME, but is a recent incompatible (among Unicode versions)
addition to Unicode, and one which is also incompatible (per
Unicode standards) with MIME.
Bruce> That of course is not a parsing issue, but one of semantic
Bruce> interpretation of the "utf-8". According to one of the many
Bruce> "utf-8" specifications, the "utf-8" stream encodes a
Bruce> language-tagged string, while in at least one other "utf-8"
Bruce> specification it encodes something quite different.
it encodes the exact same sequence of codepoints regardless of
specification. The fact that some of those codepoints are not defined
in earlier unicode versions is nothing to do with the use of UTF-8.
It is related to the different utf-8 specifications inasmuch as the
different specifications in conjunction with the change from a 16-bit
pre-3.1 character width and the newly-introduced codepoints (including
language tags) together result in obfuscated content with a different
number of bits. It's difficult to cleanly separate those issues, since
if musical notes, baroque encoding of language tags etc. hadn't been
added (deviating from the original Unicode Design Priciples), the
character width wouldn't have had to be increased beyond 16 bits and the
utf-8 4-byte sequences wouldn't need to be specified differently. It
is precisely because the Unicode versions are different that the "utf-8"
transformations specifications are necessarily different.