ietf-822
[Top] [All Lists]

Re: RFC 2047 and gatewaying

2003-01-09 10:29:25

Charles Lindsey wrote:
In <3E1731A6(_dot_)5030604(_at_)alex(_dot_)blilly(_dot_)com> Bruce Lilly 
<blilly(_at_)erols(_dot_)com> writes:


It's worse than that; there are at least 3 different versions of UTF-8.
They differ in the longer multi-byte sequences.


No, there is precisely one, as defined by the relevant Unicode documents.
See RFC 2044 and draft-yergeau-rfc2279bis-02.txt, with all of which Usefor
is fully compatible.

Charles, in spite of having been shown the differences in the past, you persist
in claiming that there are none.  RFC 2044 is not a Unicode document, and has
long been obsoleted.  One clue that you are wrong is the following quotation
from Unicode Technical report 28:

"Most notable among the corrigenda to the Standard is a further tightening of the 
definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification 
of UTF-8 more completely into line with other specifications of UTF-8. "

Obviously if the Unicode consortium states unequivocally that there are
multiple utf-8 specifications which differ, there cannot be "precicely
one" utf-8 specification.

Here, once again:

Unicode 2.0, table A-3 (applies through Unicode 3.0):

 Unicode Value     1st Byte  2nd Byte  3rd Byte  4th Byte
000000000xxxxxxx   0xxxxxxx
00000yyyyyxxxxxx   110yyyyy  10xxxxxx
zzzzyyyyyyxxxxxx   1110zzzz  10yyyyyy  10xxxxxx
110110wwwwzzzzyy +
110111yyyyxxxxxx   11110uuu  10uuzzzz  10yyyyyy  10xxxxxx

RFC 2044 is obsolete; here's the table from RFC 2279:

   UCS-4 range (hex.)           UTF-8 octet sequence (binary)
   0000 0000-0000 007F   0xxxxxxx
   0000 0080-0000 07FF   110xxxxx 10xxxxxx
   0000 0800-0000 FFFF   1110xxxx 10xxxxxx 10xxxxxx

   0001 0000-001F FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
   0020 0000-03FF FFFF   111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
   0400 0000-7FFF FFFF   1111110x 10xxxxxx ... 10xxxxxx

Unicode 3.2, Unicode Technical Report #28:

 Code Points   1st Byte   2nd Byte   3rd Byte   4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF ill-formed
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF  80..BF

Clearly RFC 2279 provides for 5- and 6-byte utf-8 sequences, which are
not provided for by Unicode through 3.2.  And some 4-byte sequences
differ in different Unicode versions (particulary those corresponding
to surrogate pairs).  Whether or not Unicode 4.0 and/or the draft
mentioned above will introduce additional variants is another matter.