Re: Interpretation of RFC 2047

[I originally sent this last Thursday, but omitted to include a Subject.
That may have caused several people not to see it, so here it is again.

It is sent primarily to the ietf-822 list (which is where Reply-To
is set) because I am anxious to hear the views of the email experts.
However, it is CCed for information to the Usefor list.]

I am having problems understanding RFC 2047. In particular, I want to
establish what a user agent would be REQUIRED to do in order to claim
minimal compliance with RFC 2047 (yes, I know actual implementations are
likely to be more 'liberal' than that, but I after a fundamental
baseline which other documents can *rely* on).

The problem splits into two parts:

1. Where can encoded-words legitimately appear (RFC 2047 section 5).
2. Where are encoded-words required to be recognized (RFC 2047 section 6).


1. Where can encoded-words legitimately appear (RFC 2047 section 5).
-------------------------------------------------------------------

In 5 (1) I find:

   An 'encoded-word' may replace a 'text' token (as defined by RFC 822)
   in any Subject or Comments header field, any extension message header
   field, or any MIME body part field for which the field body is
   defined as '*text'. An 'encoded-word' may also appear in any
   user-defined ("X-") message or body part header field.

That is ambiguous, depending on how you interpret the commas in the first
sentence:

Interpretation A:

It means you can use an encoded-word in
      any Subject
      any Comments
      any extension message header field
      any MIME body part field for which the field body is defined as '*text'
      any X-header

Interpretaion B:

It means you can use an encoded-word in
      any Subject                        )
      any Comments                       ) for which the field body is
      any extension message header field ) defined as '*text'
      any MIME body part field           )
      any X-header


Interpretation B is intended.

In Usefor, that is a structured header with a pretty obvious syntax in
which "Claus F\xE4rber" is clearly a 'phrase'. In the email version (even
if not in the news version also) that has to be encoded as:

   Mail-Copies-To: =?ISO-8859-1?Q?Claus_F=E4rber?= 
<claus(_at_)faerber(_dot_)muc(_dot_)de>


only if that portion of the field is defined as 'text' (which would be
difficult because 'text' allows '<' so it wouldn't be recognized as a
terminator)  OTOH, if that portion of the mail-copies-to field isn't 
defined as 'text' you still have the burden of encoding it in ASCII somehow,
since octet 0xE4 isn't valid in an email message header.  so it would
be left to a different spec to define how 2047 applies to mail-copies-to.

Q: Is an email message containing that header-field (or should I say the
user agent which permitted it to be sent as an email) RFC
2047-compliant?


not sure if you mean before or after 2047 encoding.   2047 doesn't
forbid use in other structured fields, but neither does it require
use in structured fields that aren't listed in 2047.

   OTOH, both those views of Interpetation B seem to presuppose that the
   user agent was familiar with the syntax of Usefor.


no, it presupposes that when gateways take things that aren't email
messages (this includes usenet messages) and sends them via email,
it makes them compliant with email message standards at that time.
gateways have to track standards on both sides.  that's life.

But maybe it was
   just a simple (non-Usefor-aware) MUA and the user had inserted that
   header manually (as is sometimes done by users emailing directly to a
   moderator).  So we would have the seemingly absurd situation that one
   user agent would be compliant when sending that message, but another
   user agent which sent that same message would be non-compliant.


it's not clear to what degree user agents are expected to prevent users
from generating invalid messages, but this isn't a 2047 issue.

Here is another example from Usefor:

   Organization: F\xE4rber Fabrik

That is, of course, an unstructured header, and would be encoded as

   Organization: =?ISO-8859-1?Q?Claus_F=E4rber Fabrik?=

Q: Is that one RFC 2047-compliant?


I'd say probably so.  though I don't know if it's formally defined anywhere,
organization is quite naturally 'text'.

It should be noted in passing that Rule 5(1) contains no requirement for
an encoded-word to be preceded (or followed) by 'linear white space',
although section 7 does seem to enforce such a requirement.


the rule is different for structured and unstructured (text) fields.

2. Where are encoded-words required to be recognized (RFC 2047 section 6).
--------------------------------------------------------------------------

One would expect section 6 to require the recognition of anything that
was allowed to appear under section 5, but that seems not to be the case
because there is no mention of "extension message header fields".

In 6.1 I find:

   A mail reader must parse the message and body part headers according
   to the rules in RFC 822 to correctly recognize 'encoded-word's.


this is intended to apply to structured fields - the point is that
for structured fields you actually have to parse them to distinguish
between places where encoded words are valid (e.g. a word before a 
phrase) and places where they are not valid (e.g. a word in a local-part
or a domain).

2. If someone is writing a standards-track document (whether for news or
email) and wishes to introduce some new header-fields that can make use
of RFC 2047, what does he have to say?


just say which elements are to be encoded and interpreted according to 2047.

3. Is it possible to go further and introduce header-fields with
explicit 'encoded-word's in them, for example:


yes.  I don't exactly like doing that (there's a lot of overlap between
'token' and 'encoded-word', for instance) but it's a widespread practice.

Keith

p.s. granted 2047 could use some clarification.