Re: latest draft - content-transfer-encoding

Walt Daniels <dan(_at_)watson(_dot_)ibm(_dot_)com> writes:

I am unhappy with the selection of content-tranfer-encodings.  When (If)
ISO 10646 DIS 2 pases and 8-bit transport is available, then UTF will be a
popular content-transfer-encoding.


Nathaniel Borenstein <nsb(_at_)thumper(_dot_)bellcore(_dot_)com> writes:

However, I think I'm missing something.  Why do you need to use UTF as
an encoding?  Why can't the sepc for iso 10646 as a character set simply
say that it the "raw" data is UTF-encoded 10646?  If the only use for an
encoding is to encode a specific type, then it isn't a
content-transfer-encoding, it's simply the chosen representation format
for that specific type.


Greg Vaudreuil <gvaudre(_at_)nri(_dot_)reston(_dot_)va(_dot_)us> writes:

UTF is not strictly speaking a transfer encoding.  It is in its
native form 8 bits, and needs to be encoded into 7 bits to be usable
via SMTP.  UTF is defined as a transport-like encoding of ISO10646, it
would likely be indicated in MIME as a character set. (A profile of
ISO10646).


I'm sorry I haven't brought up the following points earlier but
this seems to be a good opportunity.

Both Base64 and Quoted-Printable will be bad content-transfer
encodings for UTF text.  They will work of course, but at the
cost of an unreasonable (and technically unnecessarily big)
increase of the number of bytes that has to be transported by
SMTP.  This will not affect Americans or Europeans writing text
almost or mostly consisting of the letters A to Z that are
represented by one byte in UTF, but only the poorer parts of the
world, where, in addition, bandwidth is relatively much more
costly than for us.  Most of these countries or cultures will
have no alternative standard 8-bit charset value to use in MIME
messages either.

The cause of this is that in UTF all alphabetic characters of
the 16-bit form of ISO 10646 (except the ASCII characters) are
represented by either two "high" bytes (ASCII value > 127) or
one high and one low byte.  This means that the punishment for
writing text consisting of non-Latin letters will be on average
a FIVEFOLD increase of the number of bytes to send, compared to
a text consisting of the same number of ASCII letters.  This is
if the Quoted-Printable encoding is used.  With the Base64
encoding the punishment factor will "only" be 2.7.

Since different languages use parts of the 10646 coding space
with different ratios of "high-high" UTF representations to
"high-low" representations, the punishment factor for
Quoted-Printable encoding will vary between 4 and 6.
(See my more detailed analysis below.)

The conclusion of this seems to be that Base64 should be
preferred to Quoted-Printable for a future charset=UTF.  But at
the same time it is easy to envision a third encoding, let's
call it "Quoted-Low", that is easier to implement than Base64
and still gives a further reduction of the overhead for most
non-Latin scripts, for some scripts down to punishment factor
2.0.  It could work in this way:

1. Represent high octets with the corresponding low octet (except
   those corresponding to the ASCII characters CR, LF, "#", and
   "$").
2. Represent low octets (except CR and LF) by prefixing with "$".
3. Use these special representations:
   LF        ->  LF
   octet 138 -> "#J"
   CR        ->  CR
   octet 141 -> "#M"
   octet 163 -> "##"
   octet 164 -> "#$"

This presupposes that in practice ASCII control characters can
be safely sent by SMTP.  If that isn't valid the set of special
representations can be enlarged to cover the most easily damaged
control characters (NUL and ESC?).

===

COMPARISON OF THREE (OR FOUR) POSSIBLE
CONTENT-TRANSFER-ENCODINGS FOR DIFFERENT SCRIPTS

This is the result of an analysis of ISO/IEC DIS 10646-1.2
I have made:

Script  No. of (small) letters  Punishment factor
------  ----------------------  -----------------
              HL    HH          Q-P    Q-Low   (2,7 for Base64)
              --    --          ---    -----   (2.0 for 8bit)

Greek          7    14          5.4    2.3
Cyrillic      32     0          4.0    3.0
Armenian      19    19          5.0    2.5
Hebrew         2    25          5.8    2.1
Arabic        19    17          4.9    2.5
Devanagari    44    37          4.9    2.5
Thai               all          6.0    2.0
Georgian      36     3          4.2    2.9
Hiragana      21    63          5.5    2.2
Katakana      61    29          4.6    2.7
Bopomofo           all          6.0    2.0

"HL" means the UTF representation is one high octet followed by
one low octet, "HH" means it is two high octets.

===

HOW DOES THE UTF REPRESENTATION OF ISO 10646 WORK?

From: 
SCHEIN%TOROLAB5(_dot_)VNET(_dot_)IBM(_dot_)COM(_at_)SEARN(_dot_)SUNET(_dot_)SE
Sender: Multi-byte Code Issues 
<ISO10646%JHUVM(_dot_)BITNET(_at_)SEARN(_dot_)SUNET(_dot_)SE>
Reply-To: Multi-byte Code Issues 
<ISO10646%JHUVM(_dot_)BITNET(_at_)SEARN(_dot_)SUNET(_dot_)SE>
To: Multiple recipients of list ISO10646 <ISO10646(_at_)JHUVM>
Date:         Mon, 9 Dec 1991 19:03:11 EST
Message-Id: 
<9112100004(_dot_)AA24213(_at_)othello(_dot_)admin(_dot_)kth(_dot_)se>
Subject:      UTF description

I don't have a softcopy of the DIS-2 annex which describes UTF and
sample functions to convert from UCS to UTF and back. I attach
a simple table which might be useful.

-Isai

---------------------------------------------------------------------
|      UCS characters        |        UTF characters                |
|----------------------------|--------------------------------------|
|  FROM   |   TO    |  COUNT |1st byte|2nd byte|3rd byte| 4-5 bytes |
|---------|---------|--------|--------|--------|--------|-----------|
|0000 0000|0000 009F|    160 | 00-9F  |        |        |           |
|---------|---------|--------|--------|--------|--------|-----------|
|0000 00A0|0000 00FF|     96 |   A0   | A0-FF  |        |           |
|---------|---------|--------|--------|--------|--------|-----------|
|0000 0100|0000 4015| 85*190 | A1-F5  | 21-7E  |        |           |
|         |         | 16,150 |        | A0-FF  |        |           |
|---------|---------|--------|--------|--------|--------|-----------|
|0000 4016|0003 8E2D|6*190**2| F6-FB  | 21-7E  | 21-7E  |           |
|         |         |216,600 |        | A0-FF  | A0-FF  |           |
|---------|---------|--------|--------|--------|--------|-----------|
|0003 8E2E| maximum | up to  | FC-FF  | 21-7E  | 21-7E  |21-7E 21-7E|
|         |         |4*190**4|        | A0-FF  | A0-FF  |A0-FF A0-FF|
---------------------------------------------------------------------

[I have corrected a small error in the table. /OJ]