The last structural shortcoming of MIME: how to remove it

The last structural shortcoming of MIME: how to remove it
=========================================================

Introduction
------------

Ned Freed <NED(_at_)INNOSOFT(_dot_)COM> wrote in message
<01HXV3KOM0PM9BWNBV(_at_)INNOSOFT(_dot_)COM> on the ietf-822 and
ietf-types lists, apropos of the proposed URL parameter for
Message/External-Body:

One major syntax issue does arise, however, in all of these proposals. 
Embedding
URLs in message header fields brings up some intersting issues in regards to
line folding. (HTTP may not be concerned with this, but email applications
definitely are.)  URLs can be quite long, and mailers have to be able to fold
them. This is especially true if the URL is just one parameter value,
potentially one of many in a very long field.


This is only a special case of a more general, structural
shortcoming of MIME, which I think should be solved by one
single extension of MIME, that applies not only to the URL
parameter but to any parameter, existing or to be defined in the
future.


Background
----------

One of the motivations for the MIME work was to remove three
restrictions on the content that could be transported by
SMTP mail:

1) lines were restricted to a maximum of 1000 octets,

2) octets 128-255 were not allowed,

3) use of other coded character sets than US-ASCII could not be
   indicated.

The line length problem affected in particular binary content,
the high octet and character set problems affected also the
coded character sets needed for other languages than English.

All three problems were solved for message _bodies_ by MIME part 1
[RFC 1521], on the protocol level above SMTP.

For message _headers_ they were largely solved by MIME part 2
[RFC 1522]: A construct _encoded-word_ was introduced, with the
syntax

   =?<charset>?<encoding>?<encoded-text>?=

The <charset> part solves problem 3. The two defined
<encoding>s, B and Q, both allow the representation of any
octets by means of restricted subsets of US-ASCII, solving
problem 2. Several encoded-words can be placed after each other,
separated by (insignificant) linear white space including line
breaks, so also problem 1 was solved.

Encoded-words can't be used everywhere in structured RFC 822
headers. They are excluded from address specifications, because
addresses are short and can only contain US-ASCII characters on
the Internet. They were also made unusable in quoted-string, the
original RFC 822 construct for including arbitrary character
sequences in headers, because they are unnecessary there:
Encoded-words can encode any sequence of octets that are
encodable by quoted-strings, and they are allowed everywhere a
quoted-string is allowed.

But the devil hides in the details. At the same time this
general solution was introduced in MIME part 2, a new construct
for header fields was defined by MIME part 1, the header
_parameter_. It was initially used in the Content-Type header,
later also in the Content-Disposition header. The syntax is:

   <parameter-name>=<value>

If there are several parameters in a header, they are separated
by ";".  <value> was restricted to be either a MIME token, which
can't contain the characters "=" or "?", or a quoted-string. But
in the first case encoded-words can't be expressed at all, and
in the second case a part of a quoted-string looking like an
encoded-word mustn't be interpreted as an encoded-word,
according to MIME part 2.

So all three of the initial SMTP problems reappear in MIME
parameter values!


Approach to solving the problems
--------------------------------

Are these remaining problems insignificant in practice?

No. Ned Freed has already pointed out the need for very long
parameter values of the proposed URL parameter. Octets > 127 and
other coded character sets than US-ASCII are needed for
practically everything to be read by humans in other languages
than English, such as e.g. filenames. Parameters containing such
things are already defined (see the summary at the end of this
message).

Should these problems be handled on a case-by-case basis?

No, that doesn't seem to be a particularly smart approach:

-  A case-by-case approach will probably lead to different
   solutions for different parameters, increasing the burden for
   implementers.

-  The benefits of a tailor-made solution for a certain
   parameter with a known purpose are probably small.

-  In some cases the people specifying new parameters will
   probably find the task of defining methods to handle high
   octets and character set indication less important than
   delivering, in a timely manner, a specification that anyway
   is usable in most cases. This has already happened in the
   MacMIME and Content-Disposition work.

-  Problems in connection with the use of richer character
   sets than US-ASCII are foreign and somewhat frightening to
   many communication protocol specialists with no background in
   so-called software internationalization.

-  Most Anglo-American users will do very well without any
   solution to the high octet and character set problems: They
   only feel a need for using US-ASCII characters anyway. The
   drawbacks mentioned here with a case-by-case approach will
   therefore conspire to put Internet users _outside_ USA and the
   English-speaking part of Canada in a relatively less favoured
   position, although of course nobody involved in mail protocol
   development _desires_ the mail infrastructure of the Internet
   to remain culturally biased.

-  The three problems are fairly simple and, after six years
   of MIME work, well understood. One single solution to be
   applied for all parameters that needs a solution can easily
   be specified.


Suggested solution
------------------

How can this last remaining structural problem in MIME be solved
then, once for all? I have the following proposal:

Let's use the %-encoding that is defined for URLs [RFC 1738].
This encoding is even easier to implement than the Q and B
encodings. The fact that it implies a 200 % or even 500 %
overhead for most non-ASCII characters is not so important,
since the main part of most messages is not in the parameter
values.

We could extend and slightly reinterpret the MIME syntax for
parameters in this way:

a) The %-encoding is applied in values that are quoted-strings,
   not those that are MIME tokens. (This solves the high octet
   problem: A high octet is represented by "%" followed by the
   two hexadecimal digits of its value.)

b) The %-encoding is, furthermore, only applied in a
   quoted-string that starts with the three characters "=?%",
   which in other respects have no significance. This will
   reduce the backwards compatibility problem to a minimum.

   Of presently defined parameters, only CHARSET, BOUNDARY,
   NAME/FILENAME, and TYPE have a general enough syntax to be
   affected. No registered charset values start in this way. A
   Multipart boundary is not allowed to contain "%" (why I don't
   know). The likelihood of encountering filenames or type
   descriptions having these three as their first characters is
   very limited.

   In this extended syntax, a weird file name such as "=?%A"
   can be represented by a parameter
      FILE="=?%=?%25A"
   The remaining problem is not about the expressiveness of the
   extended syntax. The problem is that a parameter
      FILE="=?%A"
   generated by an implementation conforming to the current MIME
   standards will be mis-interpreted as specifying the filename
   "A" by an implementation following the proposed extension.

c) In this kind of quoted-strings, linear white space is
   insignificant and can be freely inserted. This solves the
   line length problem, thanks to the RFC 822 rules for header
   folding.

d) The initial "=?%" string may be followed by "<", a MIME
   charset value, and ">". Otherwise, "<" and ">" are not
   allowed in this kind of quoted-strings. They can be
   represented by "%3C" and "%3E". If the values of a parameter
   are text values, this specifies the coded character set of
   the value. Otherwise, it can be ignored. (This rule solves
   the character set problem for MIME parameters.)

Note that the value of the future URL parameter, according to
this scheme, can be created from any URL simply by adding the
four characters "=?% at the start and a " character at the end.
Long URLs should also be split on several lines.


What to do
----------

A specification for "Extended MIME Parameter Syntax", maybe
along these lines, should be written as soon as possible and be
put on the standards track. Other drafts, like that for a URL
external-body access-type, can be made to agree with it in
parallel.


Summary of currently registered parameters
------------------------------------------

Annotations used below
- - - - - - - - - - --
*  It would be useful if any octet value could be indicated
   in the parameter value.
** It would be useful if text of any charset could be included
   in the parameter value.
%  This is an access-type of Message/External-Body, not a subtype.
?  The registration form is ambiguous as regards parameters.
!  The registration form is incomplete as regards parameters.
#  The registration form is incorrect as regards parameters.

Content-Disposition: parameter
- - - - - - - - - - - - - - --
[RFC 1806]

Disposition value   Parameters
-----------------   ----------
inline              filename*
attachment          filename*


Content-Type: top-level media types, subtypes, and parameters
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
[ftp://ftp.isi.edu/in-notes/iana/assignments/media-types/media-types]

Top-level type  Subtype               Parameters
--------------  -------               ----------
text            plain                 charset
                richtext              charset
                enriched              charset
              ? tab-separated-values  charset

multipart       mixed                 boundary
                alternative           boundary
                digest                boundary
                parallel              boundary
                appledouble           boundary
                header-set            boundary
              ! form-data             boundary

message         rfc822                
                partial               id  number  total
                external-body         access-type  expiration  size  permission
                                      name*  site  directory  mode  server
                news                  

application     octet-stream          name*  type**  conversions  padding
                postscript            
                oda                   profile
                atomicmail            
                andrew-inset          
                slate                 version
              # wita                  
              # dec-dx                
              # dca-rft               
                activemessage         
                rtf                   
                applefile             name*  type**
                mac-binhex40          name*
              % news-message-id       name  site
                news-transmission     conversions
                wordperfect5.1        
                pdf                   
                zip                   
                macwriteii            
                msword                version
                remote-printing       
                mathematica           version  filename*
                cybercash             
                commonground          
                iges                  
              ? riscos                name*  type  load  exec  access
                eshop                 
                x400-bp               bp-type

image           jpeg                  
                gif                   
              ! ief                   
                g3fax                 page-length  page-width  encoding
                                      resolution  dcs  pages
                tiff                  

audio           basic                 

video           mpeg                  
                quicktime             


/Olle

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>