Re: UTF-8 versions (was: Re: RFC 2047 and gatewaying)


Andrew Gierth wrote:

"Bruce" == Bruce Lilly <blilly(_at_)erols(_dot_)com> writes:



I repeatedly asked you to be specific on the USEFOR list when you
started making claims about differing UTF-8 versions there; since
you declined to respond, I will deal with it here instead.


I did not decline; indeed in 
<3D8DF17D(_dot_)4040505(_at_)alex(_dot_)blilly(_dot_)com>
posted to the Usefor list Sun, 22 Sep 2002 12:36:13 -0400
I worote:

"It should also be noted that there is not a single definition of
UTF-8. For example, the definition and implementation of UTF-8
described in Unicode <= 3.0 has no provision for Unicode values
 > 16 bits, so cannot represent the language tags. That means
that if a Unicode 3.1 application inserts such tags, which are
then transmitted as Unicode 3.1 UTF-8 to a Unicode 3.0 or earlier
platform, the latter will see "garbage characters" as described in
Unicode tr20. Example: English text "foo" tagged per Unicode 3.1:

U+E0001 U+E0065 U+E006E U+0066 U+006F U+006F U+E0001 U+E007F

Transformed to Unicode 3.1 UTF-8 (hex values for 8-bit codes):

F3 A0 80 81 F3 A0 81 A5 F3 A0 81 AE foo F3 A0 80 81 F3 A0 81 8F

The Unicode 2.0 UTF-8 reverse transformation yields:

U+DBC0 U+DC01 U+DBC0 U+DC65 U+DBC0 U+DC6E U+0066 U+006F U+006F U+DBC0 U+DC01 
U+DBC0 U+DC7F

That has 5 surrogate pairs which were not in the original
Unicode 3.1 text.

Some equivalent RFC 2047 codings of English "foo", suitable for use
in common headers (Subject, display name in From, Reply-To, etc.) are:

1. =?us-ascii*en?q?foo?=
2. =?utf-8?q?=F3=A0=80=81=F3=A0=81=A5=F3=A0=81=AEfoo=F3=A0=80=81=F3=A0=81=8F?=

Obviously the first is
a) much more compact
b) much easier for a reader with no Unicode or MIME support to
   determine the language
c) much easier for a reader with no Unicode or MIME support to read

Note also that the second just fits in the 75-character limit for an
encoded-word; if a text string longer than 3 characters were encoded,
or if a longer language tag were used, multiple encoded-words would be
required.  Note also that any header line containing an encoded-word
must be 76 characters or shorted (not including CRLF), so if the
second version were part of any header, it would have to appear alone
on a continuation line.  But the second form should never appear,
since Unicode 3.1 prohibits use of language tags in conjunction with
MIME or in the absence of ACAP. "

That of course is not a parsing issue, but one of semantic interpretation
of the "utf-8".  According to one of the many "utf-8" specifications, the
"utf-8" stream encodes a language-tagged string, while in at least one
other "utf-8" specification it encodes something quite different.