Re: 2822 revised grammar


On Sun July 24 2005 02:11, Frank Ellermann wrote:


Bruce Lilly wrote:

word            =       atom / quoted-string

It's still not an abbreviation.


In syntax it is, <word> is an "abbreviation" for
<atom> or <quoted-string>.


No sale.

CFWS = *([FWS] comment) (([FWS] comment) / FWS)

your idea: "Read <word> as <atom> or <quoted-string>".

I never said that.


 <200507212140(_dot_)37793(_at_)mail(_dot_)blilly(_dot_)com>
| If you read CFWS as "comments and/or folding
| whitespace" there's no problem.


That's in prose, and I didn't say "zero or more instances of
FWS followed by a comment and followed by either FWS and a comment
or FWS alone" or anything like that; I explained how the abbreviation
is used in prose, in a manner consistent with use throughout the
document including the Appendix B statement of the rule.

 [revised grammar]

as mentioned, there are issues related to phrase that
affect encoded-words.


Sorry, you lost me there, is this about <specials> vs.
<tspecials> and similar nits ?  Or in other words, is
is possible to fix it in a 2231bis / 2047bis based on
2822, or is it something that has to be done together
with 2822bis ?  Do you have an example ?


2047, currently based on 822, says:

   Among the symbols defined in RFC 822 and
   referenced in this memo are: 'addr-spec', 'atom', 'CHAR', 'comment',
   'CTLs', 'ctext', 'linear-white-space', 'phrase', 'quoted-pair'.
   'quoted-string', 'SPACE', and 'word'.  Successful implementation of
   this protocol extension requires careful attention to the RFC 822
   definitions of these terms. 

A revision taking 2822 or its successor into account would have
similar dependencies. As an example, 2047 also says:

   For example, an 'encoded-word' in a
   'phrase' preceding an address in a From header field may not contain
   any of the "specials" defined in RFC 822.

2822 defines phrase as:

word            =       atom / quoted-string

phrase          =       1*word / obs-phrase

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

atom            =       [CFWS] 1*atext [CFWS]

That means that the restriction on encoded-words "in a 'phrase'", if
applied to 2822, would include encoded-words in comments, since the
comments (as part of CFWS) are *in* the phrase (unlike 822, where comments
were not specified in the phrase syntax, but could be freely inserted in
many places).  That is quite a difficult problem to solve; the revised
grammar attempts a solution by clearly defining the different types of
encoded-words (in phrases, in comments, and in unstructured fields) and
where they may appear.  It would still require careful rewording of a
2047 successor.  Because of the interdependence, such issues need to be
considered when the 2822 successor is produced.

It's the most important feature of MIME that you can
completely ignore it whenever it pleases you, without
affecting the message format or transport (modulo some
8bit issues).  Have you found any bug where that won't
work ?


RFC 2047:

   While there is no limit to the length of a multiple-line header
   field, each line of a header field that contains one or more
   'encoded-word's is limited to 76 characters.

Unfolding, refolding, and other modification needs to take that into
account.  It's not a "bug", it's merely a fact that MIME is widely used
and cannot be ignored.

The problem is that the 2822 definitions of phrase,
unstructured, and comment have to be consistent with
what 2047 uses.


Yes, they have to be consistent, e.g. if I know CFWS
and treat it as "semantically invisible" I'd want the
same result with or without knowing 2047 / 2231.


See above w.r.t. "in a phrase".

For <unstructured> I've no idea how that could be ever
a "problem", as you said.


RFC 2047 again:

(1) An 'encoded-word' may replace a 'text' token (as defined by RFC 822)
    in any Subject or Comments header field, any extension message
    header field, or any MIME body part field for which the field body
    is defined as '*text'.  An 'encoded-word' may also appear in any
    user-defined ("X-") message or body part header field.

    Ordinary ASCII text and 'encoded-word's may appear together in the
    same header field.  However, an 'encoded-word' that appears in a
    header field defined as '*text' MUST be separated from any adjacent
    'encoded-word' or 'text' by 'linear-white-space'.

That's the entire section 5 text regarding unstructured field use of
encoded-words.  Note that it refers to 'text' but says nothing about
specials.  It is unclear whether or not
   Subject:=?us?q?foo_bar?=
is legal (the encoded-word is adjacent to the colon, which isn't part
of the field body, and isn't field body 'text', although colon is a valid
character in the 'text' grammar rule).  With 2047 and 822, there's also
a question about line folding, which differs from "linear-white-space".
It's unclear whether
   Subject:foo
    =?us?q?bar_baz?=
is legal, because the encoded word is separated from the text "foo" not
by linear-white-space but by line folding.

These issues need to considered when the message format syntax is specified
so that the necessary encoded-word usage rules can be properly (clearly,
concisely, and unambiguously) specified without requiring yet another
revision to the message format specification.

But since you insisted on 
allowing the obs-phrase dots in USEFOR I fear that it
could bite "us" (= USEFOR).  And if it does it's also
a problem in 2822 or 2047.


Not a big problem. Dot is a special and is subject to the rules regarding
specials and encoded-words in a phrase.  The 2822 obs-phrase rule affects
an unquoted dot in a phrase.  Currently, dot is not allowed in a Q-encoded
encoded-word; that could conceivably be revised to allow it e.g. adjacent
to an encoded accented initial.  Otherwise it would be treated per 2047
and 2822 rules, e.g.
   From: Egbert =?iso-8859-1?q?_=C9?= . Egglesthwaite
would match2822 obs-phrase and 2047 rules; quoting the dot would be fully
conforming for generation.

(=?us-ascii?Q?example?=?=)  Something's odd there.

See RFC 2047 section 5, paragraphs labeled "(2)" (and
referring to the errata for 2047).


Okay, no "(", ")", "\", and separated by LWSP from ctext
or any other encoded-word.

Of course a literal '?' can't appear in the encoded-text
(2047 sect. 4.2 "(3)").  However, B and Q encodings
differ, and there may be private-use or future encodings
will still other characteristics


Yes, but no "?" is a _general_ rule for <encoded-text> in
chapter 2, it does not depend on the encoding.


?!?  There is no way to have a question mark in B-encoded encoded-text
as that is not one of the 64 characters used in B encoding.  It is
absolutely related to the encoding, which is why special mention is made
for Q encoding but not for B encoding (where it cannot happen).

And your 
cchar allows "?" (%d63):

    cchar =  %d33-39 /       ; Printable US-ASCII
             %d42-91 /       ;  characters not including "(",
             %d93-126        ;  ")", or "\"

While the ABNF could be tweaked somewhat, there is a limit
to what can be achieved with ABNF


Just exclude "?" in addition to "(", ")", "\".   Bye, Frank


No, because cchar is also used for comment content, and '?' is
perfectly legal in a comment.  At minimum, a separate rule would
have to be formulated.  There would be a temptation to have not
one, but two rules because the characters are different for Q and B
encoding, but then there is no provision for private-use or future
encodings.  So yes, it *could* be done, but there would have to be
some consensus about whether it's worthwhile going to that level of
detail for something already covered in the encoding rules and in
consideration of the liabilities w.r.t. private-use and future
encodings.