Re: Revisiting RFC 2822 grammar (obs-utext and unstructured)


Bruce Lilly wrote:

obs-text only appears in text (where obs-char makes more sense) and
obs-utext (which is defined as obs-text). obs-utext appears only in utext,
and utext appears only in unstructured.

I'll take another look at the unstructured and related productions.


OK, here's another attempt.  Working backwards from unstructured fields:

comments         =   "Comments" ":" unstructured CRLF
obs-comments     =   "Comments" *WSP ":" [FWS] obs-unstructured CRLF
etc.

N.B. no [FWS] after the colon in the non-obs fields (an unstructured
field cannot be allowed to end with FWS (which it would do if the
"unstructured" production is empty) using non-obs rules, since that
would leave CRLF 1*WS CRLF, i.e. a whitespace only "continuation"
line at the end of the field, which is prohibited by sect. 3.2.3.).
Per contra, that must be permitted in the obs- unstructured fields.
That is why separate unstructured and obs-unstructured productions
are required.

"unstructured" must therefore permit beginning with FWS, but only
if there is content after the FWS. It must permit an empty instance
so that the unstructured field body may be empty.  Since there is
[FWS] in the obs- unstructured fields, and we don't want an explicit
case of two adjacent instances of FWS (CRLF 1*WSP CRLF being provided
for by obs-FWS), a separate obs-unstructured is required, and it must
NOT begin with FWS, but may end with FWS.

unstructured     =   [utext] *(FWS utext)

That doesn't group FWS with a token to its left. It could be rewritten
to do so for the repetition, but would still need provision for leading
FWS (as noted above):

unstructured     =   [[FWS] *(utext FWS) utext]

obs-unstructured =   *(utext FWS) [utext]

To avoid multiple adjacent FWS, and to avoid ending with FWS as mentioned
above, utext must be non-empty. Because any two instances of utext must
be separated by FWS, utext may begin with LF and/or end with CR with no
danger of an unintended CRLF followed by non-WSP appearing. utext must
permit any non-zero number of characters, with any sequence except CRLF
(Charles, you may suggest a different name if you like, but I'll use utext
for consistency with 2822). The obs- form of utext must permit any
US-ASCII octet.  obs-char includes all octets except LF and CR.  So, it's
not too difficult to come up with a definition of utext given those
characteristics and constraints:

utext            =  1*text / obs-utext
obs-utext        =  (CR / 1*(LF / (*CR 1*obs-char))) *CR

text will be as in 2822 except obs-char instead of obs-text, and
with the comment noting exclusion of NUL from non-obs text.  There
is no need for an obs-text production (obs-char could be renamed).

Going back over the required characteristics:
utext must be non-empty. Check.
utext must permit any nonzero number of characters in any sequence
  except CRLF. The first octet can be CR, LF, or obs-char (that's
  every possibility).  The last octet can be CR, LF, or obs-char. In
  between, one can have any number of obs-char, any number of LF, and
  any number of CR provided the last CR is followed by something other
  than LF (i.e. it must be followed by at least one obs-char).  Check.
the non-obs- variety of utext consists of one or more text characters
  (excluding obs-char), which allows any octet except CR, LF, and NUL.
  Seems OK to me.
the "unstructured" production can be empty or can begin with utext or FWS.
  If it is not empty (e.g. if it begins with FWS), it must have some utext.
  Any utext may be followed by FWS and more utext. And so on, ending with
  utext (unless of course unstructured is completely empty).
obs-unstructured can be empty or it can begin with utext. If there is
  additional utext, there must be FWS between the utext instances. It
  may end with FWS or with utext. It cannot begin with FWS.

Can anybody see any problems with that?