Interpretation of RFC 2047


[I originally sent this last Thursday, but omitted to include a Subject.
That may have caused several people not to see it, so here it is again.

It is sent primarily to the ietf-822 list (which is where Reply-To
is set) because I am anxious to hear the views of the email experts.
However, it is CCed for information to the Usefor list.]

I am having problems understanding RFC 2047. In particular, I want to
establish what a user agent would be REQUIRED to do in order to claim
minimal compliance with RFC 2047 (yes, I know actual implementations are
likely to be more 'liberal' than that, but I after a fundamental
baseline which other documents can *rely* on).

The problem splits into two parts:

1. Where can encoded-words legitimately appear (RFC 2047 section 5).
2. Where are encoded-words required to be recognized (RFC 2047 section 6).


1. Where can encoded-words legitimately appear (RFC 2047 section 5).
-------------------------------------------------------------------

In 5 (1) I find:

   An 'encoded-word' may replace a 'text' token (as defined by RFC 822)
   in any Subject or Comments header field, any extension message header
   field, or any MIME body part field for which the field body is
   defined as '*text'. An 'encoded-word' may also appear in any
   user-defined ("X-") message or body part header field.

That is ambiguous, depending on how you interpret the commas in the first
sentence:

Interpretation A:

It means you can use an encoded-word in
        any Subject
        any Comments
        any extension message header field
        any MIME body part field for which the field body is defined as '*text'
        any X-header

Interpretaion B:

It means you can use an encoded-word in
        any Subject                        )
        any Comments                       ) for which the field body is
        any extension message header field ) defined as '*text'
        any MIME body part field           )
        any X-header

I am inclined to believe Interpretation A, because I if I had wanted
Interpretation B, I would have written "... any extension message header
field or any MIME body part field, for which the field body is defined
as '*text'". Of course, the difference only arises in the case of
extension message header fields.


Now suppose I am writing a standards-track document and I want to
introduce a new header-field. Under RFC 822, it would be regarded as an
"extension-field" (under 2822 it would be an "optional-field").

Let us take a specific example from Usefor (note that any news article
is potentially also an email message, because it may have been
posted-and-mailed, or it may be en route to a moderator). So I can
write:

   Mail-Copies-To: Claus Färber <claus(_at_)faerber(_dot_)muc(_dot_)de>

In Usefor, that is a structured header with a pretty obvious syntax in
which "Claus Färber" is clearly a 'phrase'. In the email version (even
if not in the news version also) that has to be encoded as:

   Mail-Copies-To: =?ISO-8859-1?Q?Claus_F=E4rber?= 
<claus(_at_)faerber(_dot_)muc(_dot_)de>

Note that a news user agent has some semantic duties to perform when it
sees that header, but all an email user agent is expected to do is not
to munge or delete it, and to enable it to be displayed to the user (at
least if the user asks to see it).

Q: Is an email message containing that header-field (or should I say the
user agent which permitted it to be sent as an email) RFC
2047-compliant?

A: Under Interpretation A, Yes. Because it is an extension-field which
   satisfies the requirements of Rule 5(1).
   
   Under Interpretation B, No. Because the field body is not defined as
   '*text'.

   However, even with Interpretation B, it might get by under Rule 5(3)
   because, under the Usefor syntax, it is within a 'phrase'.

   OTOH, both those views of Interpetation B seem to presuppose that the
   user agent was familiar with the syntax of Usefor. But maybe it was
   just a simple (non-Usefor-aware) MUA and the user had inserted that
   header manually (as is sometimes done by users emailing directly to a
   moderator).  So we would have the seemingly absurd situation that one
   user agent would be compliant when sending that message, but another
   user agent which sent that same message would be non-compliant.

Here is another example from Usefor:

   Organization: Färber Fabrik

That is, of course, an unstructured header, and would be encoded as

   Organization: =?ISO-8859-1?Q?Claus_F=E4rber Fabrik?=

Q: Is that one RFC 2047-compliant?

A: Yes, under both Interpretations A and B (though under B one might
   wonder how the user agent was supposed to know that it was
   unstructured).

It should be noted in passing that Rule 5(1) contains no requirement for
an encoded-word to be preceded (or followed) by 'linear white space',
although section 7 does seem to enforce such a requirement.



2. Where are encoded-words required to be recognized (RFC 2047 section 6).
--------------------------------------------------------------------------

One would expect section 6 to require the recognition of anything that
was allowed to appear under section 5, but that seems not to be the case
because there is no mention of "extension message header fields".

In 6.1 I find:

   A mail reader must parse the message and body part headers according
   to the rules in RFC 822 to correctly recognize 'encoded-word's.

Again, I see two interpretaions:

Interpretation C:

The wording "rules in RFC 822" means that only the headers explicitly
defined in RFC 822 are required to be examined for the presence of
'encoded-word's.

Interpretation D:

The wording "rules in RFC 822" includes the rules for 'extension-field'
and 'user-defined-field'. The RFC 822 syntax for 'extension-field' is
     extension-field =
                   <Any field which is defined in a document
                    published as a formal extension to this
                    specification; none will have names beginning
                    with the string "X-">
Hence "must parse the message" means that the rules in the document
defining the extension are to be applied.

Ad Interpretation C:-

The text I quoted above refers to "message and body part headers". But
since RFC 822 does not define any body part headers, Interpretation C
would not permit any body part header field to be examined (body part
header fields are introduced in section 5.1 of RFC 2046, and are all
supposed to be of the form "Content-*"). For example, you could not
recognize an encoded-word in a "Content-Description", let alone in
"Organisation" or "Mail-Copies-To". So I cannot see how Interpretation C
could have been the intended one.

Ad Interpretation D:-

OTOH, Interpretation D seems to require that all user agents be
magically aware of all new extension headers as soon as their defining
documents are published.

Moreover, you cannot have an agent which attempts to recognize anything
that just happens to look like an encoded-word because it needs to know,
at the very least, whether some unknown header is "unstructured" or not
(i.e. is defined as '*text'). For example,
"(=?ISO-8859-1?Q?Claus_F=E4rber?=)" can occur and should be recognised
in a structured field (if it is in a context where a comment would be
allowed), but it cannot occur in an unstructured field and should
therefore be displayed in its un-decoded form (as is explained in the
examples in section 8 of RFC 2047).

Thus the best interpretation I can place on section 6 is that a
compliant mail reader MUST recognize and decode 'encoded-word's that
occur in the headers explicitly defined in RFC 2822, and that it
MAY/SHOULD/MUST/SOMETHING-ELSE recognize all 'encoded-word's produced by
a compliant agent (as in section 5).



3. And finally ...........
--------------------------

The questions I am unsure about:

1. Which of my "Interpretations" is correct, or are there other possible
Interpretations that I have missed?

2. If someone is writing a standards-track document (whether for news or
email) and wishes to introduce some new header-fields that can make use
of RFC 2047, what does he have to say? Clearly, he defines syntax that
shows whether those fields are unstructured or not, and that introduces
'phrase's and 'comment's in the proper manner. But does he have to
include a remark to the effect that "RFC 822 (or RFC 2822) is hereby
augmented to include these new header-fields, and RFC 2047 is to be
construed accordingly"?

3. Is it possible to go further and introduce header-fields with
explicit 'encoded-word's in them, for example:

    ueser-agent-header = "User-Agent" ":" 1*( product ["/" token] )
                         ; OK, it also needs to show where WS goes
    product = token / quoted-string / encoded-word





Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clw(_dot_)cs(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 
Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5