Re: Fixing RFC 1641

At 11:32 AM 11/30/94, Valdis(_dot_)Kletnieks(_at_)vt(_dot_)edu wrote:

I would like to fix this problem so that there will be a means of
transmitting Unicode directly, not encoded with UTF-7 or UTF-8, both of
which impose some overhead. Clearly this would not be for interoperability
with non-Unicode or non-MIME sites, but it would be convenient for
communication between sites using Unicode.


Is there anything Unicode-1-1 needs different from text/plain besides the
removal of the CRLF restriction?  And can you specify the exact nature of
the problem for those of us who aren't Unicode-literate?


The problem is that the CRLF restriction is binary, and requires the octet
sequence 0D 0A be used for line breaks and only for line breaks. Since
Unicode/10646 is a 16 bit character set, this octet sequence does not mean
line break and can in fact occur as parts of other characters, e.g.:

0D0A;MALAYALAM LETTER UU

or

090D;DEVANAGARI LETTER CANDRA E
0A20;GURMUKHI LETTER TTHA

The CRLF sequence would be represented in Unicode as 000D 000A, although
many other line break conventions are possible, including use of:

2029;PARAGRAPH SEPARATOR

or

2028;LINE SEPARATOR


The problem is that the CRLF convention is dependent on the RFC821 spec
of CRLF and 1000-char lines, so I dont see an easy way of removing it
until/unless you either (a) accept some sort of CTE or (b) that you can only
do it over a connection that uses an SMTP extension to negotiate a binary
transfer.


RFC 1641 recommended that unicode-1-1 only be used with binary-safe cte's
such as binary or base64. This is fine for unicode-1-1 as it's not readable
by a recipient without both MIME and Unicode support anyway. However, the
new draft MIME spec doesn't allow that kind of out. It says all subtypes of
text must use CRLF conventions, period, regardless of cte. My understanding
from discussions on this list is that this is a necessary fact of life for
compatibility with existing software. I would have been happy with a cte
solution, but the consensus seems to be it has to work the way the new spec
says.


Would it be acceptable to use unicode-1-1 with some sort of byte-stuffing
hack rather than the full utf-7 or utf-8, similar to the way rfc821
specifies doubling a '.' that is by itself on a line?


Well, yes, but that would be yet another transformation format, something
I'd rather avoid.

This is a long term issue, because right now there are few enough binary
transports that Unicode would always get sent as base64 anyway, in which
case you might as well send it as UTF-7. I would like to start sounding out
a solution, however.

----------------------------
David Goldsmith
david_goldsmith(_at_)taligent(_dot_)com
Senior Scientist
Taligent, Inc.
10201 N. DeAnza Blvd.
Cupertino, CA  95014-2233