ietf-822
[Top] [All Lists]

Re: Revisiting RFC 2822 grammar

2004-01-15 20:12:36

In <3FF7A5FC(_dot_)9080804(_at_)verizon(_dot_)net> 
blilly(_at_)verizon(_dot_)net writes:

Pete Resnick wrote:

I notice that the ABNF in there has a few things that are non-2822 
such as encoded words. Do you have a copy of the ABNF which is purely 
the 2822 replacement that we can post to the ietf-2822 list?

I have an older version which does not have the encoded-word grammar 
(full text below).

Following is the text of the modified grammar w/o encoded-word grammar, 
interspersed
with some notes:

rfc2822grammar_simplified.txt version 0.13 2001/08/08 16:02:35  
excerpted from RFC 2822 and modified by Bruce Lilly


quoted-pair     =       ("\" text)
[N.B. had redundant obs-qp alternative]

I think not. The obs version allows \NUL, \CR and \LF, which the regular
version does not.

[N.B. RFC 822 ASCII NUL not permitted, even with obs- rules]

Bruce gives many examples of differences from RFC 822. I will leave it to
others to comment on the rights and wrongs, but some of them certainly
look like bugs in RFC 2822 to me.

atom            =       1*atext [CFWS]

And here starts the changes to remove ambiguities from the grammar.
Essentially, we want to get rid of cases where the grammar can produce two
CFWSs side by side (because it makes it harder for the human reader to
puzzle it out, and it makes it harder to derive automatic parsers, whether
LR(1) or otherwise).

When I first encountered this syntax, I convinced myself that it would be
more trouble than it was worth, and hugely increase the size of the
grammar, to remove these cases. Now Bruce has unconvinced me.

Essentially, he arranges that all the relevant constructs can be followed
by a CFWS (usually optional) but never preceded by one. So he has removed
around a dozen occurrences of [CFWS] from the syntax, and inserted around
30 new occurrences. AFAICS his revisions all work, and are worthwhile.

I have a few niggles, and there are still some other ambiguities which he has
not touched, which I will come to later.

date-time       =       ([ day-name "," [FWS]] date FWS time [CFWS]) / 
obs-date-time

All the obs- versions of month, day, etc are gone, and are replaced by a
single rule for obs-date-time. On the face of it, this is a good move, but
it leads to problems later on, as we shall see.

zone            =       ( "+" / "-" ) 4DIGIT
[N.B. no CFWS between +- and 4DIGIT]

Indeed. Are you saying that such was allowed in RFC 822?

message         =       (fields / obs-fields) [CRLF body]

But this rule leads to horrendous ambiguities, with no prospect of
avoiding them in less than 50 pages of syntax :-( . But I shall defer
discussion of that till later, because there are other issues with it.

subject         =       "Subject:" [FWS] [("cmsg" / "Re: ") [FWS]] 
unstructured CRLF
[ RFC 1036 sect. 2.2.6 "cmsg" Subject hack, sect. 2.1.4 "Re: " ]

AAARRRRRGGGGGGGGGHHHHHHHHHHH!

Please, no "Re: " or "cmsg" in the syntax. Quite apart from the introduced
ambiguity (the 'unstructured' could begin with those things anyway), the
"Re: " convention is better described by verbiage in the semantics. We
have removed it from the syntax of Usefor (and Bruce was one of the people
who urged that). It remains as a semantic convention with wording with
much the same effect as the wording currently in RFC 2822 (and even that
is not cast in concrete yet). As for "cmsg", that convention is no longer
implemented anywhere AFAIK; Usefor says it MUST NOT be used for any
semantic effect, though it still recommends putting it there for old
time's sake when a Control header is present (though I am now dubious
about even that).

path            =       ("<" [CFWS] [addr-spec] ">" [CFWS]) / obs-path

I think it would be better to say

path            = angle-addr / "<" [CFWS] ">" [CFWS]

That way you avoid the need for obs-path (obs-angle-address takes care of
it)


Now we come to the obs- syntax, where there are still many ambiguities. As
things stand, sometimes the obs- syntax allows something that is already
in the regular syntax (that is ambiguous). OTOH, sometimes it does not,
and sometimes it allows only a part of what is in the regular syntax, all
of which can be very confusing to the reader who tries to work out exactly
how the obs- syntax differs from the regular.

It would be much better for each bit of obs- syntax to produce only the
extra bits which are not already in the regular stuff, and as I shall show
this is quite easily done.

obs-qp          =       "\" (%d0-127)
[N.B. unnecessary]

But if you do keep it, all it needs is:
obs-qp          = "\" ( NUL / LF / CR )

obs-text        =       %d0-127

should be:
obs-text        = NUL / LF / CR

obs-char        =       %d0-9 / %d11 /          ; %d0-127 except CR and
                       %d12 / %d14-127         ;  LF


It would be clearer to say:
obs-char        =       utext / NUL / WSP

obs-utext       =       *LF *CR *(obs-char *LF *CR)
[N.B. was obs-text]

No, that does not work because it allows CRLF not followed by WSP in the
middle of an 'unstructured'. I think the only way out of that is to
rewrite the rule for 'unstructured':

unstructured    =       *(utext [FWS]) obs-ltext
obs-utext       =       (1*LF *CR / 1*CR) obs-char / NUL
obs-ltext       =       *LF *CR

obs-phrase      =       word *(word / ("." [CFWS]))

OK, but that is not "obsolete". It is intended as an extension to be
allowed sometime in the future on a "MUST accept, SHOULD NOT generate yet"
basis. So please can we rename it as 'extended-phrase' (which is what I
have currently put in Usefor).

obs-phrase-list =       phrase / (1*([phrase] "," [CFWS]) [phrase])

To be truly "obs-", that needs to contain at least one occurrence of two
"."s with no phrase between them. A syntax to achieve this would be:

obs-phrase-list =       phrase *("," [CFWS] phrase)
                          2*("," [CFWS])
                          *(1*("," [CFWS]) phrase)

obs-FWS         =       1*WSP *(CRLF 1*WSP)

should be:
obs-FWS         =       2*(*WSP CRLF 1*WSP)

obs-date-time   =       [ day-name [CFWS] "," [CFWS]] obs-date [CFWS] 
FWS [CFWS] obs-time [CFWS]

Essentially, if all the CFWS are, in fact, FWS, then it is regular;
otherwise, it is obs-. But it would be a pain to write it all as one rule
that way. However, if you were to restore the separate obs-month, obs-day,
etc as in RFC 2822, I think it could be done quite easily.

obs-angle-addr  =       "<" [CFWS] [obs-route] addr-spec ">" [CFWS]

should be:
obs-angle-addr  =       "<" [CFWS] obs-route addr-spec ">" [CFWS]

obs-local-part  =       word *("." [CFWS] word)

That is a tricky one. Essentially, a phrase can consist of a collection of
atoms and quoted-strings separated by ("."\xA0[CFWS]). To be considered
"obs", it has to contain, somewhere, either a genuine CFWS, or an
(atom\xA0"."\xA0quoted-string), or a (quoted-string\xA0"."\xA0atom), or a
(quoted-string\xA0"."\xA0quoted-string). If it does not have one of those
somewhere, it is regular. Here is the syntax to do it:

obs-local-part  =       *(word ".") word "." CFWS word *("." [CFWS] word) /
                        *(atom ".") atom "." quoted-string *("." word) /
                        *(quoted-string ".") quoted-string "." atom *("." word) 
/
                        1*(quoted-string ".") quoted-string

obs-domain      =       atom *("." [CFWS] atom)

should be:
obs-domain      =       dot-atom 1*("." CFWS dot-atom)

obs-mbox-list   =       1*([mailbox] "," [CFWS]) [mailbox]

Needs same treatment as obs-phrase-list.

obs-addr-list   =       1*([address] "," [CFWS]) [address]

Needs same treatment as obs-phrase-list.

obs-fields      =       *(obs-return / obs-received / obs-orig-date / 
obs-from / obs-sender / obs-reply-to / obs-to / obs-cc / obs-bcc / 
obs-message-id / obs-in-reply-to / obs-references / obs-subject / 
obs-comments / obs-keywords / obs-resent-date / obs-resent-from / 
obs-resent-send / obs-resent-rply / obs-resent-to / obs-resent-cc / 
obs-resent-bcc / obs-resent-mid / obs-optional)

obs-orig-date   =       "Date" *WSP ":" [CFWS] date-time CRLF

If you retain the ambiguous syntax for 'message' then this rule, and the
following ones like it, are fine. But if you do away with that syntax (see
below), then they would need to be changed to

obs-orig-date   =       "Date" 1*WSP ":" [CFWS] date-time CRLF

in order to disambiguate them from their regular counterparts. Also, some
of them which include explicit obs- syntax would need some attention.

obs-subject     =       "Subject" *WSP ":" [FWS] [("cmsg" / "Re:") 
[FWS]] unstructured CRLF
[ RFC 1036 sect. 2.2.6 "cmsg" hack, 2.1.4 "Re:" (w/ or w/o space) ]

But, please no, "Re:\xA0" or "cmsg".

obs-path        =       obs-angle-addr

Not needed if my suggestion above for 'path' is adopted.



Not back to the message grammar:

message         =       (fields / obs-fields) [CRLF body]

That is trying to kill two birds with one stone:
  1) to force some order into the regular headers;
  2) to allow WSP before the ":", and also a few extra obs- features

The problem is that everything that turns up in fields also turns up in
obs-fields. Hence it is ambiguous (and I doubt it is LR(1) either). And it
would be totally unfixable by writing further syntax (except as a
theoretical possibility. So the alternatives are
  a) Leave it ambiguous, and admit that it is so, or
  b) Enforce the ordering of the headers by verbiage, rather than
     syntactic means.

However, before doing either of those, please reconsider whether the
ordering you are trying to enforce is too rigid. Currently, RFC2822
requires:

1. Return-Path
2. 1*Received
3. *Resent-xxx
4. Other headers

Yes, it is a good idea that tracing headers be added at the top, so you
can tell the order in which the message passed through various agents, but
there are some useful cases which have been excluded, for example:

Received: from D by E
Received: from C by D
Resent-To: bar(_at_)E
Resent-From: foo(_at_)C
Received: from B by C
Received: from A by B

IOW, why forbid keeping a record of how it travelled from its origin to
the place where it was resent?

Here is another example (a real one this time, which some readers of
uk.net.news.management may recognize):

Received:  from lon-mail-1.gradwell.net (localhost [127.0.0.1])
 by clerew.man.ac.uk (8.11.7+Sun/8.11.7) with ESMTP id i05HCjF01021
 for <chl(_at_)clerew(_dot_)man(_dot_)ac(_dot_)uk>; Mon, 5 Jan 2004 17:12:45 
GMT        
Delivered-To:   postmaster(_at_)A       
Received:  (qmail 81124 invoked by uid 800); 5 Jan 2004 12:54:22 -0000  
Delivered-To:   forwarding-chl(_at_)clerew(_dot_)man(_dot_)ac(_dot_)uk  
X-Gradwell-SpamScore:   ssss    
X-Gradwell-SpamScore:   ssss    
X-Gradwell-Mailfilter:  Spam detected by SpamAssassin with 4.0 hits
 (3 required)   
X-Gradwell-Mailfilter:  SpamAssassin hits were PRIORITY_NO_NAME
 RCVD_IN_DYNABLOCK RCVD_IN_SORBS X_PRIORITY_HIGH        
X-Envelope-To:  chl(_at_)clerew(_dot_)man(_dot_)ac(_dot_)uk     
X-Forwarding-To:        chl(_at_)clerew(_dot_)man(_dot_)ac(_dot_)uk     
Received:  (qmail 80864 invoked from network); 5 Jan 2004 12:54:02 -0000
Received:  from newred.gradwell.net (193.111.200.20)
 by lon-mail-1.gradwell.net with SMTP; 5 Jan 2004 12:54:02 -0000        
Received:  (qmail 12659 invoked by uid 1148); 5 Jan 2004 12:54:01 -0000 
Mailing-List:   contact committee-help(_at_)usenet(_dot_)org(_dot_)uk; run by 
ezmlm     
Reply-To:       committee(_at_)usenet(_dot_)org(_dot_)uk        
List-Post:      <mailto:committee(_at_)usenet(_dot_)org(_dot_)uk>       
List-Help:      <mailto:committee-help(_at_)usenet(_dot_)org(_dot_)uk>  
Delivered-To:   mailing list committee(_at_)usenet(_dot_)org(_dot_)uk   
Received:  (qmail 12622 invoked from network); 5 Jan 2004 12:54:00 -0000
Received:  from lon-mail-2.gradwell.net (193.111.201.126)
 by newred.gradwell.net with SMTP; 5 Jan 2004 12:54:00 -0000    
Received:  (qmail 71512 invoked by uid 800); 5 Jan 2004 12:54:00 -0000  
Delivered-To:   forwarding-committee(_at_)usenet(_dot_)org(_dot_)uk     
X-Gradwell-SpamScore:   ssss    
X-Gradwell-Mailfilter:  Not Spam, SpamAssassin hits of 4.0 (5 required) 
Received:  (qmail 71352 invoked from network); 5 Jan 2004 12:53:49 -0000
Received:  from host217-42-124-162.range217-42.btcentralplus.com
 (HELO smtp-relay.vlaad.co.uk) (217.42.124.162) by lon-mail-2.gradwell.net
 with SMTP; 5 Jan 2004 12:53:49 -0000   
Received:  from gst-group.co.uk (localhost [127.0.0.1]) by
 smtp-relay.vlaad.co.uk with SMTP (Mailtraq/2.4.0.1534) id SMTPE9B4633C;
 Mon, 05 Jan 2004 12:53:22 -0000        
Mime-Version:   1.0     
.....

Now there are all sorts of perfectly genuine "tracing headers" in there,
all added in transit, and all useful. Some of them are X-headers (so you
need the concept of an "X-tracing-header"). Some, like "Delivered-To"
probably should have been X-headers. Some of them, like the "List.*" ones
are properly defined by an RFC, just not by RFC 2822. And that "Reply-To"
in the middle was added by the mailing list expander, as its position
indicates.

It is non-conformant with RFC 2822 but, IMO, it ought not to be.

So what RFC 2822bis really needs is some careful discussion of what
tracing headers are and how they are to be added, and not a rigid syntax.


-- 
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: chl(_at_)clerew(_dot_)man(_dot_)ac(_dot_)uk      Snail: 5 Clerewood Ave, 
CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5