ietf-822
[Top] [All Lists]

Re: Why do MIME imlementations sometimes fail

2003-09-27 07:18:49

Jacob Palme wrote:

(Possibly one could criticize MIME for not supporting more
than one character set within the same body part. I know
that some MIME implementations implemenent multiple body
parts in such a way that not even a line break need be
visible between the body part, but does the MIME standard
specify that it should work in this way?)

Yes, RFC 2046 gives the syntax for multipart MIME entities, and
that syntax is such that an entity may end at some content other
than CRLF (CRLF introduces a delimiter).  So technically it is
possible to change charset and/or language by using a multipart
message, probably multipart/related.  However, it would be
impractical, as the MIME "markup" (delimiters, MIME-part headers)
would be at least as onerous for a reader using a non-MIME UA
as the corresponding HTML markup in a non-HTML UA. And in practice
there are too many broken implementations -- e.g. many fail to
present the parts of multipart/related, particularly if multiple
levels of nesting are used (an example was discussed here: see
http://www.imc.org/ietf-822/mail-archive/msg02593.html).

The fact that this problem still happens, even if it is
a failure, might be a reason that the MIME standard should
maybe expressly specify that text copied between body parts,
or between header and body parts, should gets its encoding
transformed to the encoding of the target body part.

One issue facing any potential implementor of such a scheme is
that one needs to know a substantial amount of context. Whether
or not a sequence of octets is an (RFC 2047) encoded-word depends
heavily on its context, to the point of requiring knowledge of
which header field it appears in and the syntax of that field, as
well as a number of rather complex rules (which could perhaps be
defined more clearly). For example,
   From: =?ISO-8859-1?Q?foo?= =?ISO-8859-1?Q?bar?=(_at_)baz(_dot_)com
contains exactly one encoded-word.
   Received: from [1.2.3.4] by [5.6.7.8] for 
=?ISO-8859-1?Q?bar?=(_at_)baz(_dot_)com ; 1 Jan 2003 12:34 +0100 
(=?iso-8859-1?q?Mitteleurop=E4ische_Zeit?=)
contains no encoded-words, whereas
   Date: 1 Jan 2003 12:34 +0100 (=?iso8859-1?q?Mitteleurop=E4ische_Zeit?=)
contains an encoded-word, and
   X-blurfl: =?ISO-8859-1?Q?foo?= (=?iso8859-1?q?Mitteleurop=E4ische_Zeit?=) ; 
=?ISO-8859-1?Q?bar?=
might contain from 0 to 3 encoded-words depending on the definition
of the X-blurfl header field -- and if you don't know what that
definition is, you can't determine how many encoded-words it
contains.  The implementor must know the syntax of every header
field, including private-use fields.  And he must know the context
of the text which is to be copied (i.e. if the user highlights some
text in a GUI, the implementor needs some way to retrieve the
header field in which the text appears, even if that header field
is folded across multiple lines). Consider whether or not
  Content-Location: http://users.erols.com/blilly/mailparse/(=?us-ascii?q?=3D?=)
contains an encoded-word.

Another issue is that the encoded-word in a header field may use
a different charset from that specified for the body text. Indeed,
if multiple encoded-words are to be copied, each one might specify
a different charset (see the example and NOTE at the beginning of
RFC 2047 section 8).  A body part has a single charset. The
implementor is faced with the prospect of having to be able to
convert any of a quite large number of charsets
(http://www.iana.org/assignments/character-sets)
to any of the others.  That of course requires detailed knowledge
of all of those charsets and is further complicated by the fact that
several charsets use shift sequences (so what needs to be pasted may be
sensitive to the destination context). It should be noted that
some charset transformations are irreversible.  In some cases, no
transformation is possible (e.g. because the target charset has no
provision for the character in the source charset).

Anybody who thinks either of those issues is trivial is welcome to
try his hand at an implementation.

(RFC 2047) Encoded-words (and IDNs) are primarily intended for
presentation; encoded-words are never part of any protocol exchange
(encoded-words only appear in comments, in unstructured fields such
as Subject, and in phrases such as a display name). Valid domain
name components consist solely of (US-ASCII) letters, digits, and
the hyphen character (LDH) -- the stuff obtained by transforming the
LDH octets that comprise an IDN is solely for presentation. I can
see no situation where it is *necessary* to copy an encoded-word
into a text/plain message body.

Given the implementation difficulties and lack of necessity, I
would not be in favor of any requirement such as proposed. Any
UA implementor who wishes to make provision for copying header field
content into a plain text message body or vice versa needs to
carefully consider the issues and reconsider whether such provision
is wise.

The main reason for this problem is, in my belief, that
implementors live in the U.S.A. and do not meet this
problem themselves.

Perhaps it has more to do with the nature of the "problem"?