ietf-822
[Top] [All Lists]

Why the 822bis grammar is so painful

1999-02-06 18:50:51
RFC 822 says that structured field bodies are parsed as follows:

   characters
   (e.g., "6 Feb 1999")

     |
     | tokenize
     V

   spaces, tabs, comments,
   "<", ">", ",", ";", ":", "@", ".", atoms, domain literals, quoted strings
   (e.g., atom 6, space, atom Feb, space, atom 1999)

     |
     | remove spaces, tabs, comments
     V

   "<", ">", ",", ";", ":", "@", ".", atoms, domain literals, quoted strings
   (e.g., atom 6, atom Feb, atom 1999)

     |
     | parse
     V

   higher-level data

It's easy to give precise English descriptions of each of these steps.
See http://pobox.com/~djb/proto/immhf.html.

Pete Resnick, over the objections of several implementors on DRUMS,
threw away the RFC 822 tokenizer. He wrote a new ABNF grammar that
starts from sequences of characters, rather than sequences of tokens.
ABNF is a weak programming language in which simple lexing steps such as

   Read as many characters as possible, stopping before the first ...

and

   Now remove all comments

are a royal pain to handle correctly, so it's hardly a surprise that
Resnick made some big mistakes in his grammar, and that the result is
much more difficult to read than RFC 822.

Charles Lindsey writes:
  [ ``foobar'' being parsed as two atoms ]
However, I gather Pete Resnick has spotted this discussion and is
taking it up on the DRUMS list. I hope they fix it.

I raised the same issue on the DRUMS mailing list in 1996.

Resnick was claiming that English was error-prone while ABNF was not:
``Having everything in the grammar leaves no ambiguity, and having them
in the prose is almost guaranteeing it.''

I pointed out that the evidence was against him: ``Really? How come your
grammar allows `To: anything I want'? How come your grammar allows the
string `foo' to be parsed as three atoms?'' Of course, these ambiguities
are extremely difficult to eliminate from the formal grammar.

Resnick's response: ``I don't think it makes a difference, unless you do
something silly like try to feed tokens you've received and parsed into
something that's going to put comments or whitespace between them.''

I said that an incorrect spec was unacceptable, and suggested writing
the grammar in C instead of ABNF. Resnick didn't respond.

---Dan