Re: quoted-phrase in content-disposition header

Keith Moore <moore(_at_)cs(_dot_)utk(_dot_)edu> writes:

I'm a bit concerned about the use of 'quoted-phrase' in the
Content-Disposition draft.  If nothing else, encoded-words aren't
allowed within double quotes in other headers, and I would rather not
make an exception for Content-Disposition.

I have two alternative suggestions:


I prefer a solution based on Alternative 2.

Alternative 2:

For parameters, I'd rather see an encoding scheme that allowed not
just character data, but also binary parameters encoded with base64.
Something like:

value = ( token / quoted-string / base64-chunk-list )

base64-chunk-list = [ charset ] "[" *base64-chunk "]"

base64-chunk = 1*17 ( 4*4 ( b64char ) )

b64char = Any of the ASCII characters: 
          "A"-"Z", "a"-"z", "0"-"9", "+", "=", "/"

token and quoted-string implicitly have ASCII values
for base64-chunk-list, if charset is omitted, values are either ASCII
        or binary, as appropriate for that parameter
otherwise charset is the name of a MIME charset.

any amount of linear-white-space (including "CRLF SPACE") and/or
comments may appear between base64-chunks, but they are ignored when
decoding.

This might seem less general than encoded-words, because it doesn't
let you mix character sets in a single parameter.  However, you can
still use MIME charsets that allow charset switching (e.g.
ISO-2022-JP).


I agree that the possibility to switch between MIME charsets
within a parameter value is unimportant. The only character set
switching system used in data processing is ISO 2022. It is
almost exclusively used in Eastern Asia, and each application
is confined to a very limited subset of the baroque complexity
of full ISO 2022. These different profiles should be registered
as separate MIME charset values like ISO-2022-JP.

I would suggest the following modifications of Alternative 2:

1) In addition to the Base64-like encoding, there should be a
   Quoted-Printable-like encoding. Reasons for this are that
   such encodings are already provided for message bodies and
   unstructured headers, and that Quoted-Printable is more
   practical than Base64 for names in many Euro-American
   languages, where non-ASCII letters are important though not
   dominant.

   One way of doing this would be to change the syntax of e.g.

      filename = iso-8859-1 [R/Z0dGluZ2VuLm1hcA==]

   to

      filename = :b:iso-8859-1: R/Z0dGluZ2VuLm1hcA==

   Alternatively, this could be written:

      filename = :q:iso-8859-1: G=F6ttingen.map

   (yes, a map of the German university town of G<o-umlaut>ttingen).

   Nice side-effects of this syntax are that it is clear
   already from the first character whether the "value" is of
   the syntactical category "token", "quoted-string", or
   "data-chunk-list", and that the superfluous "]" at the end is
   avoided. (Parameters end at the next unquoted ";", or at the
   end of the header.)

2) When the parameter is binary, the syntax would be e.g.

      x-binary-param = :b:: c4JJZ1h1gmLXmAIBDQ==

   To make this scheme even more general, not only arbitrary
   byte sequences but also arbitrary bit sequences should be
   encodable. A technique similar to the Padding parameter of
   Application/Octet-stream should be used.

      x-binary-param = :b:-3: c4JJZ1h1gmLXmAIBDQ==

   would then be the same data as above, but with the three last
   bits of the last non-padding byte also regarded as padding.

3) To make the new encoding scheme less byzantine than RFC 1522
   encoding and less likely to be distorted by bad RFC 822
   implementations, some now unnecessary uses of quoted strings
   (according to the RFC 822 definition) should be abolished,
   namely all use of control characters and SPACE in a
   quoted-string.

4) Similarly, in the :q: encoding, SPACE, all control
   characters, and the characters ( ) ; " = should have to be
   encoded by =hh sequences.

5) I don't know why base64-chunks are restricted to 68
   characters. The specification of the new encoding scheme
   doesn't have to include length restrictions.

6) An empty parameter value may be specified by a quoted-string
   "", so we may as well allow syntactically empty parameter
   values:

      Content-Disposition: attachment; x-param=; filename=file.ext

7) The specification of the new encoding scheme should be very
   explicit about at which points linear-white-space can be
   inserted without changing the semantics.

To illustrate the relative simplicity of this modified scheme I
include here an almost full EEBNF description. I have extended
the EBNF of RFC 822 by these constructs:

+  Set difference operator "--": "CHAR7--DIGIT" means any
   "CHAR7" except those that also are "DIGIT"s.

+  Simplified character enumeration construct "AnyOf:": It
   indicates any of the following characters up to the next line
   break, and must thus be the last meta-syntactical unit on the
   line, not even a comment may follow. As an example
      AnyOf: ():",
   is equivalent to
      "("/")"/":"/<">/","

+  In string literals, genuine <"> characters can be indicated
   by doubling, like in Pascal strings.

This is the almost full EEBNF description. It only ignores one
aspect -- where linear-white-space can be inserted.

0    value           = token / q-string / data-chunk-list / ""

1    token           = 1*(GRAPH7 -- tspecials)

11   tspecials       = AnyOf: ()<>@,;:\"/[]?=

2    q-string        = """" *(q-char/q-pair) """"

21   q-char          = GRAPH7 -- AnyOf: "\

22   q-pair          = "\" GRAPH7

3    data-chunk-list = b-chunk-list / q-chunk-list

31   b-chunk-list    = ":" ("b"/"B") ":" [charset-token/padding] ":" *b-chunk

311  padding         = "-" AnyOf: 01234567

312  b-chunk         = *(4*4 b-char)

3121 b-char          =
       AnyOf: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/=

32   q-chunk-list    = ":" ("q"/"Q") ":" charset-token ":"
                       [ q-chunk *("=" CRLF q-chunk) ]

321  q-chunk         = *(q-char/octet)

3211 q-char          = GRAPH7 -- AnyOf: ();"=

3212 octet           = "=" 2 AnyOf: 0123456789ABCDEFabcdef

4    GRAPH7          = CHAR -- CTL -- SPACE

41   CHAR            = DefinedIn: RFC 822

42   CTL             = DefinedIn: RFC 822

43   SPACE           = DefinedIn: RFC 822

5    charset-token   = iana-token / x-token

51   iana-token      = DefinedIn: RFC 1521

52   x-token         = DefinedIn: RFC 1521

6    CRLF            = DefinedIn: RFC 822

If the new encoding scheme is introduced not in the
Content-Disposition: specification but as a separate general
extension of MIME -- an alternative I favour -- it will be
necessary to specify the context in which the "value", as
defined in rule 0, occurs. We need a general syntax for the
Content-*: header fields where parameters are possible. I think
something like the following should be adequate:

 parameterized-header = field-name ":" list *(";" parameter) CRLF

     list             = [ element-sequence *("," element-sequence) ]

     element-sequence = *element

     element          = mailbox / word / aggregate

     aggregate        = token *("/" token)

     parameter        = attribute "=" value

     attribute        = token

The syntax of different Content-*: headers will then be
specialized forms of this general syntax:

Content-Type: has "aggregate"s as only kind of "element"s, and
all "element-sequence"s as well as "list"s have only one
component.

Content-Transfer-Encoding: in addition only has one "token" in
"aggregate"s, and has no "parameter"s.

Content-Disposition: is similar to Content-Transfer-Encoding:
but has "parameter"s.

Content-Language: has no "parameter"s, "word" as only kind of
"element"s, and the "element-sequence"s have only one component.

Content-ID: has no "parameter"s, "list"s and "element-sequence"s
have only one component, and "mailbox" as the only kind of
"element"s. (The syntax of these are further restricted to
"msg-id" according to RFC 822.)

--
Olle Jarnefors, Royal Institute of Technology, Stockholm 
<ojarnef(_at_)admin(_dot_)kth(_dot_)se>