Re: Revisiting RFC 2822 grammar (obs-utext and unstructured)


Charles Lindsey wrote:

In <401AFF6C(_dot_)4030806(_at_)verizon(_dot_)net> Bruce Lilly 
<blilly(_at_)verizon(_dot_)net> writes:

I think that the following does the right thing (obs-utext can be empty
or it can start or end with any ASCII octet and can have any sequence
except CRLF, which is handled in unstructured by FWS, and unstructured
can be completely empty or it can begin or end with utext or FWS, but
any instance of utext is separated from any other instance by FWS):

obs-utext       =       *(*obs-char (*LF / (*CR 1*obs-char))) *CR



So, if X is some randon obs-char, "XCR" is an obs-utext, and "LF" in an
obs-utext. Therefore "XCRLF" is an unstructured. Q.N.E.D.

unstructured    =       *(utext FWS) *utext



No, that is not right with either your subsequent fix or with Pete's
subsequent fix, because you allow two FWS adjacent, and Pete does not
allow an unstructured consisting of FWS and nothing else.


OK, there's a problem with 2822; it needs an obs-unstructured as well
as unstructured (see section 4 introductory paragraphs, specifically
the reference to 3.2.3).  Here's one way to tie everything together:

text as currently defined in 2822, with the comment extended to note that
     NUL is also excluded

obs-char as currently defined in 2822, i.e. any ASCII character except
    CR and LF

unstructured = *(text [FWS])
   assuming unstructured fields are defined as in my revised grammar, e.g.
   comments = "Comments" ":" [FWS] unstructured CRLF
   (see discussion below)
   optionally one could define
   utext = *(text [FWS])
   and then define unstructured as utext, but what would be the point...

obs-utext either as defined in 2822 or as above, i.e. empty, can start
   or end with obs-char, CR, or LF, but can't have CRLF pair

obs-unstructured = *(obs-utext FWS) [obs-utext]
   i.e. cannot have two adjacent instances of obs-utext strings (must
   have FWS separator), may have multiple adjacent FWS instances (since
   obs-utext may be empty, and in order to comply with the section 4
   normative text regarding parsing of WS-only continuation lines), may
   be empty, may begin or end with any obs-utext string or with FWS,
   any CRLF pair is followed by WS (as part of FWS)

with obs- forms of unstructured fields as in

obs-comments = "Comments" *WSP ":" obs-unstructured CRLF

Discussion: my revised grammar starts unstructured field bodies with
[FWS} after the colon, partly as a result of handling Subject hacks
and partly to be consistent in grouping (possibly optional) FWS/CFWS
to the right of another token to enable unambiguous LR(1) parsing.
The obs- forms of unstructured fields could be similarly defined,
however [FWS] after the colon would be redundant since obs-unstructured
may begin with FWS.  A formal field definition consistent with the
non-obs fields w.r.t. explicit [FWS] after the colon still might be
desirable, but the redundancy should be noted in a comment in the ABNF
or in nearby text.

Other considerations: "unstructured" might be referred to by extension
RFCs.  Are there any problems with the definition above (e.g. w.r.t.
a possible RFC 2047 successor, which currently refers to unstructured
fields as being defined as (RFC 822) *text)?

Note that an unstructured field body begins with [FWS], explicitly at
least in the non-obs cases.  Therefore, in
   Subject: foo
the field body is " foo", not "foo", and
   Subject: Re: foo
begins with " Re:", not with "Re:", so the wording of section 3.6.5
should be revised (or the field name/field body delimiter formally
redefined to include any [FWS] or [CFWS} (as the case may be) following
the colon). [And A.2 needs to warn about line length limits, including
those in effect when encoded-words are present; "prepending" is
inadequate as a means of implementation.]