Why the 822bis grammar is so painful

1999-02-06 18:50:51
RFC 822 says that structured field bodies are parsed as follows:

   (e.g., "6 Feb 1999")

     | tokenize

   spaces, tabs, comments,
   "<", ">", ",", ";", ":", "@", ".", atoms, domain literals, quoted strings
   (e.g., atom 6, space, atom Feb, space, atom 1999)

     | remove spaces, tabs, comments

   "<", ">", ",", ";", ":", "@", ".", atoms, domain literals, quoted strings
   (e.g., atom 6, atom Feb, atom 1999)

     | parse

   higher-level data

It's easy to give precise English descriptions of each of these steps.

Pete Resnick, over the objections of several implementors on DRUMS,
threw away the RFC 822 tokenizer. He wrote a new ABNF grammar that
starts from sequences of characters, rather than sequences of tokens.
ABNF is a weak programming language in which simple lexing steps such as

   Read as many characters as possible, stopping before the first ...


   Now remove all comments

are a royal pain to handle correctly, so it's hardly a surprise that
Resnick made some big mistakes in his grammar, and that the result is
much more difficult to read than RFC 822.

Charles Lindsey writes:
  [ ``foobar'' being parsed as two atoms ]
However, I gather Pete Resnick has spotted this discussion and is
taking it up on the DRUMS list. I hope they fix it.

I raised the same issue on the DRUMS mailing list in 1996.

Resnick was claiming that English was error-prone while ABNF was not:
``Having everything in the grammar leaves no ambiguity, and having them
in the prose is almost guaranteeing it.''

I pointed out that the evidence was against him: ``Really? How come your
grammar allows `To: anything I want'? How come your grammar allows the
string `foo' to be parsed as three atoms?'' Of course, these ambiguities
are extremely difficult to eliminate from the formal grammar.

Resnick's response: ``I don't think it makes a difference, unless you do
something silly like try to feed tokens you've received and parsed into
something that's going to put comments or whitespace between them.''

I said that an incorrect spec was unacceptable, and suggested writing
the grammar in C instead of ABNF. Resnick didn't respond.