The last structural shortcoming of MIME: how to remove it
=========================================================
Introduction
------------
Ned Freed <NED(_at_)INNOSOFT(_dot_)COM> wrote in message
<01HXV3KOM0PM9BWNBV(_at_)INNOSOFT(_dot_)COM> on the ietf-822 and
ietf-types lists, apropos of the proposed URL parameter for
Message/External-Body:
One major syntax issue does arise, however, in all of these proposals.
Embedding
URLs in message header fields brings up some intersting issues in regards to
line folding. (HTTP may not be concerned with this, but email applications
definitely are.) URLs can be quite long, and mailers have to be able to fold
them. This is especially true if the URL is just one parameter value,
potentially one of many in a very long field.
This is only a special case of a more general, structural
shortcoming of MIME, which I think should be solved by one
single extension of MIME, that applies not only to the URL
parameter but to any parameter, existing or to be defined in the
future.
Background
----------
One of the motivations for the MIME work was to remove three
restrictions on the content that could be transported by
SMTP mail:
1) lines were restricted to a maximum of 1000 octets,
2) octets 128-255 were not allowed,
3) use of other coded character sets than US-ASCII could not be
indicated.
The line length problem affected in particular binary content,
the high octet and character set problems affected also the
coded character sets needed for other languages than English.
All three problems were solved for message _bodies_ by MIME part 1
[RFC 1521], on the protocol level above SMTP.
For message _headers_ they were largely solved by MIME part 2
[RFC 1522]: A construct _encoded-word_ was introduced, with the
syntax
=?<charset>?<encoding>?<encoded-text>?=
The <charset> part solves problem 3. The two defined
<encoding>s, B and Q, both allow the representation of any
octets by means of restricted subsets of US-ASCII, solving
problem 2. Several encoded-words can be placed after each other,
separated by (insignificant) linear white space including line
breaks, so also problem 1 was solved.
Encoded-words can't be used everywhere in structured RFC 822
headers. They are excluded from address specifications, because
addresses are short and can only contain US-ASCII characters on
the Internet. They were also made unusable in quoted-string, the
original RFC 822 construct for including arbitrary character
sequences in headers, because they are unnecessary there:
Encoded-words can encode any sequence of octets that are
encodable by quoted-strings, and they are allowed everywhere a
quoted-string is allowed.
But the devil hides in the details. At the same time this
general solution was introduced in MIME part 2, a new construct
for header fields was defined by MIME part 1, the header
_parameter_. It was initially used in the Content-Type header,
later also in the Content-Disposition header. The syntax is:
<parameter-name>=<value>
If there are several parameters in a header, they are separated
by ";". <value> was restricted to be either a MIME token, which
can't contain the characters "=" or "?", or a quoted-string. But
in the first case encoded-words can't be expressed at all, and
in the second case a part of a quoted-string looking like an
encoded-word mustn't be interpreted as an encoded-word,
according to MIME part 2.
So all three of the initial SMTP problems reappear in MIME
parameter values!
Approach to solving the problems
--------------------------------
Are these remaining problems insignificant in practice?
No. Ned Freed has already pointed out the need for very long
parameter values of the proposed URL parameter. Octets > 127 and
other coded character sets than US-ASCII are needed for
practically everything to be read by humans in other languages
than English, such as e.g. filenames. Parameters containing such
things are already defined (see the summary at the end of this
message).
Should these problems be handled on a case-by-case basis?
No, that doesn't seem to be a particularly smart approach:
- A case-by-case approach will probably lead to different
solutions for different parameters, increasing the burden for
implementers.
- The benefits of a tailor-made solution for a certain
parameter with a known purpose are probably small.
- In some cases the people specifying new parameters will
probably find the task of defining methods to handle high
octets and character set indication less important than
delivering, in a timely manner, a specification that anyway
is usable in most cases. This has already happened in the
MacMIME and Content-Disposition work.
- Problems in connection with the use of richer character
sets than US-ASCII are foreign and somewhat frightening to
many communication protocol specialists with no background in
so-called software internationalization.
- Most Anglo-American users will do very well without any
solution to the high octet and character set problems: They
only feel a need for using US-ASCII characters anyway. The
drawbacks mentioned here with a case-by-case approach will
therefore conspire to put Internet users _outside_ USA and the
English-speaking part of Canada in a relatively less favoured
position, although of course nobody involved in mail protocol
development _desires_ the mail infrastructure of the Internet
to remain culturally biased.
- The three problems are fairly simple and, after six years
of MIME work, well understood. One single solution to be
applied for all parameters that needs a solution can easily
be specified.
Suggested solution
------------------
How can this last remaining structural problem in MIME be solved
then, once for all? I have the following proposal:
Let's use the %-encoding that is defined for URLs [RFC 1738].
This encoding is even easier to implement than the Q and B
encodings. The fact that it implies a 200 % or even 500 %
overhead for most non-ASCII characters is not so important,
since the main part of most messages is not in the parameter
values.
We could extend and slightly reinterpret the MIME syntax for
parameters in this way:
a) The %-encoding is applied in values that are quoted-strings,
not those that are MIME tokens. (This solves the high octet
problem: A high octet is represented by "%" followed by the
two hexadecimal digits of its value.)
b) The %-encoding is, furthermore, only applied in a
quoted-string that starts with the three characters "=?%",
which in other respects have no significance. This will
reduce the backwards compatibility problem to a minimum.
Of presently defined parameters, only CHARSET, BOUNDARY,
NAME/FILENAME, and TYPE have a general enough syntax to be
affected. No registered charset values start in this way. A
Multipart boundary is not allowed to contain "%" (why I don't
know). The likelihood of encountering filenames or type
descriptions having these three as their first characters is
very limited.
In this extended syntax, a weird file name such as "=?%A"
can be represented by a parameter
FILE="=?%=?%25A"
The remaining problem is not about the expressiveness of the
extended syntax. The problem is that a parameter
FILE="=?%A"
generated by an implementation conforming to the current MIME
standards will be mis-interpreted as specifying the filename
"A" by an implementation following the proposed extension.
c) In this kind of quoted-strings, linear white space is
insignificant and can be freely inserted. This solves the
line length problem, thanks to the RFC 822 rules for header
folding.
d) The initial "=?%" string may be followed by "<", a MIME
charset value, and ">". Otherwise, "<" and ">" are not
allowed in this kind of quoted-strings. They can be
represented by "%3C" and "%3E". If the values of a parameter
are text values, this specifies the coded character set of
the value. Otherwise, it can be ignored. (This rule solves
the character set problem for MIME parameters.)
Note that the value of the future URL parameter, according to
this scheme, can be created from any URL simply by adding the
four characters "=?% at the start and a " character at the end.
Long URLs should also be split on several lines.
What to do
----------
A specification for "Extended MIME Parameter Syntax", maybe
along these lines, should be written as soon as possible and be
put on the standards track. Other drafts, like that for a URL
external-body access-type, can be made to agree with it in
parallel.
Summary of currently registered parameters
------------------------------------------
Annotations used below
- - - - - - - - - - --
* It would be useful if any octet value could be indicated
in the parameter value.
** It would be useful if text of any charset could be included
in the parameter value.
% This is an access-type of Message/External-Body, not a subtype.
? The registration form is ambiguous as regards parameters.
! The registration form is incomplete as regards parameters.
# The registration form is incorrect as regards parameters.
Content-Disposition: parameter
- - - - - - - - - - - - - - --
[RFC 1806]
Disposition value Parameters
----------------- ----------
inline filename*
attachment filename*
Content-Type: top-level media types, subtypes, and parameters
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/media-types]
Top-level type Subtype Parameters
-------------- ------- ----------
text plain charset
richtext charset
enriched charset
? tab-separated-values charset
multipart mixed boundary
alternative boundary
digest boundary
parallel boundary
appledouble boundary
header-set boundary
! form-data boundary
message rfc822
partial id number total
external-body access-type expiration size permission
name* site directory mode server
news
application octet-stream name* type** conversions padding
postscript
oda profile
atomicmail
andrew-inset
slate version
# wita
# dec-dx
# dca-rft
activemessage
rtf
applefile name* type**
mac-binhex40 name*
% news-message-id name site
news-transmission conversions
wordperfect5.1
pdf
zip
macwriteii
msword version
remote-printing
mathematica version filename*
cybercash
commonground
iges
? riscos name* type load exec access
eshop
x400-bp bp-type
image jpeg
gif
! ief
g3fax page-length page-width encoding
resolution dcs pages
tiff
audio basic
video mpeg
quicktime
/Olle
--
Olle Jarnefors, Royal Institute of Technology, Stockholm
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>