ietf-822
[Top] [All Lists]

Re: Revisiting RFC 2822 grammar

2004-01-03 22:35:55

Pete Resnick wrote:

Hi Bruce,

As I said on the ietf-822 list, I was getting started on updates to RFC 2822 to move it along to draft and was looking at:

<http://users.erols.com/blilly/mparse/rfc2822grammar_simplified.txt>

I notice that the ABNF in there has a few things that are non-2822 such as encoded words. Do you have a copy of the ABNF which is purely the 2822 replacement that we can post to the ietf-2822 list?

pr

I have an older version which does not have the encoded-word grammar (full text below).

The rationale for adding encoded-word grammar is:
a) 822 (as amended by RFC 2047 section 5 and as further amended by RFC 2231) had it,
   though spread out over three documents
b) it is necessary for MIME-conforming implementations
c) the encoded-word rules are rather complex -- I believe that the grammar in the current document (URI above) covers everything except the rule prohibiting encoded-words in Received fields. In particular, the rules regarding adjacent linear whitespace are quite
   complex.

From an implementor's perspective, I'd like to see all of the relevant base grammar (i.e. base field and supporting grammar) in a single document; indeed, one of the benefits of 2822 is that it consolidated most of the "... amends RFC 822" piecemeal details into a single document (obviously, the 2047/2231 amendments somehow didn't make it into 2822). I don't believe there is any harm in including the encoded-word grammar as encoded-words appear in the higher-level constructs as alternatives to ccontent, word,
and utext.

Following is the text of the modified grammar w/o encoded-word grammar, interspersed
with some notes:

rfc2822grammar_simplified.txt version 0.13 2001/08/08 16:02:35 excerpted from RFC 2822 and modified by Bruce Lilly

NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                       %d11 /          ;  that do not include the
                       %d12 /          ;  carriage return, line feed,
                       %d14-31 /       ;  and white space characters
                       %d127

text            =       %d1-9 /         ; Characters excluding CR and LF
                       %d11 / %d12 / %d14-127 / obs-text

specials        =       "(" / ")" /     ; Special characters used in
                       "<" / ">" /     ;  other parts of the syntax
"[" / "]" / ":" / ";" / "@" / "\" / "," / "." / DQUOTE

quoted-pair     =       ("\" text)
[N.B. had redundant obs-qp alternative]

FWS             =       ([*WSP CRLF] 1*WSP) /   ; Folding white space
                       obs-FWS

ctext           =       NO-WS-CTL /     ; Non white space controls
                       %d33-39 /       ; The rest of the US-ASCII
                       %d42-91 /       ;  characters not including "(",
                       %d93-126        ;  ")", or "\"
[N.B. RFC 822 ASCII NUL not permitted, even with obs- rules]

ccontent        =       ctext / quoted-pair / comment

comment         =       "(" *([FWS] ccontent) [FWS] ")"

CFWS            =       *([FWS] comment) (([FWS] comment) / FWS)

atext           =       ALPHA / DIGIT / ; Any character except controls,
                       "!" / "#" /     ;  SP, and specials.
                       "$" / "%" /     ;  Used for atoms
"&" / "'" / "*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" / "|" / "}" / "~"

atom            =       1*atext [CFWS]

dot-atom        =       dot-atom-text [CFWS]

dot-atom-text   =       1*atext *("." 1*atext)

qtext           =       NO-WS-CTL /     ; Non white space controls
                       %d33 /          ; The rest of the US-ASCII
                       %d35-91 /       ;  characters not including "\"
                       %d93-126        ;  or the quote character
[N.B. RFC 822 ASCII NUL not permitted, even with obs- rules]

qcontent        =       qtext / quoted-pair

quoted-string   =       DQUOTE [FWS] *(qcontent [FWS]) DQUOTE [CFWS]

word            =       atom / quoted-string

phrase          =       1*word / obs-phrase

utext           =       NO-WS-CTL /     ; Non white space controls
                       %d33-126 /      ; The rest of US-ASCII
                       obs-utext

unstructured    =       *(utext [FWS])

date-time = ([ day-name "," [FWS]] date FWS time [CFWS]) / obs-date-time

day-name = "Mon" / "Tue" / "Wed" / "Thu" / "Fri" / "Sat" / "Sun"

date            =       day FWS month-name FWS year

year            =       4*DIGIT

month-name = "Jan" / "Feb" / "Mar" / "Apr" / "May" / "Jun" / "Jul" / "Aug" / "Sep" / "Oct" / "Nov" / "Dec"

day             =       1*2DIGIT

time            =       time-of-day FWS zone

time-of-day     =       hour ":" minute [ ":" second ]

hour            =       2DIGIT

minute          =       2DIGIT

second          =       2DIGIT

zone            =       ( "+" / "-" ) 4DIGIT
[N.B. no CFWS between +- and 4DIGIT]

address         =       mailbox / group

mailbox         =       name-addr / addr-spec

name-addr       =       [display-name] angle-addr

angle-addr      =       ("<" [CFWS] addr-spec ">" [CFWS]) / obs-angle-addr

group           =       display-name ":" [CFWS] [mailbox-list] ";" [CFWS]

display-name    =       phrase

mailbox-list    =       (mailbox *("," [CFWS] mailbox)) / obs-mbox-list

address-list    =       (address *("," [CFWS] address)) / obs-addr-list

addr-spec       =       local-part "@" [CFWS] domain

local-part      =       dot-atom / quoted-string / obs-local-part

domain          =       dot-atom / domain-literal / obs-domain

domain-literal  =       "[" [FWS] *(dcontent [FWS]) "]" [CFWS]

dcontent        =       dtext / quoted-pair

dtext           =       NO-WS-CTL /     ; Non white space controls
                       %d33-90 /       ; The rest of the US-ASCII
                       %d94-126        ;  characters not including "[",
                                       ;  "]", or "\"
[N.B. RFC 822 ASCII NUL not permitted, even with obs- rules]

message         =       (fields / obs-fields) [CRLF body]

body            =       *(*998text CRLF) *998text

fields = *(trace *(resent-date / resent-from / resent-sender / resent-to / resent-cc / resent-bcc / resent-msg-id)) *(orig-date / from / sender / reply-to / to / cc / bcc / message-id / in-reply-to / references / subject / comments / keywords / optional-field)

orig-date       =       "Date:" [FWS] date-time CRLF

from            =       "From:" [CFWS] mailbox-list CRLF

sender          =       "Sender:" [CFWS] mailbox CRLF

reply-to        =       "Reply-To:" [CFWS] address-list CRLF

to              =       "To:" [CFWS] address-list CRLF

cc              =       "Cc:" [CFWS] address-list CRLF

bcc             =       "Bcc:" [CFWS] [address-list] CRLF

message-id      =       "Message-ID:" [CFWS] msg-id CRLF

in-reply-to     =       "In-Reply-To:" [CFWS] 1*msg-id CRLF

references      =       "References:" [CFWS] 1*msg-id CRLF

msg-id          =       ( "<" id-left "@" id-right ">" [CFWS]) / obs-msg-id

id-left         =       dot-atom-text / no-fold-quote

id-right        =       dot-atom-text / no-fold-literal

no-fold-quote   =       DQUOTE *(qtext / quoted-pair) DQUOTE

no-fold-literal =       "[" *(dtext / quoted-pair) "]"

subject = "Subject:" [FWS] [("cmsg" / "Re: ") [FWS]] unstructured CRLF
[ RFC 1036 sect. 2.2.6 "cmsg" Subject hack, sect. 2.1.4 "Re: " ]

comments        =       "Comments:" [FWS] unstructured CRLF

keywords        =       "Keywords:" [CFWS] phrase *("," [CFWS] phrase) CRLF

resent-date     =       "Resent-Date:" [FWS] date-time CRLF

resent-from     =       "Resent-From:" [CFWS] mailbox-list CRLF

resent-sender   =       "Resent-Sender:" [CFWS] mailbox CRLF

resent-to       =       "Resent-To:" [CFWS] address-list CRLF

resent-cc       =       "Resent-Cc:" [CFWS] address-list CRLF

resent-bcc      =       "Resent-Bcc:" [CFWS] [address-list] CRLF

resent-msg-id   =       "Resent-Message-ID:" [CFWS] msg-id CRLF

trace           =       [return] 1*received

return          =       "Return-Path:" [CFWS] path CRLF

path            =       ("<" [CFWS] [addr-spec] ">" [CFWS]) / obs-path

received = "Received:" [CFWS] name-val-list ";" [FWS] date-time CRLF

name-val-list   =       [*(name-val-pair CFWS) name-val-pair]
[N.B. 2822 specification does not provide for mandatory CFWS at end of list (as opposed to RFC 821 (required <SP>) and 2821)
   [name-val-pair CFWS *(name-val-pair CFWS)]
]

name-val-pair   =       item-name CFWS item-value

item-name       =       ALPHA *(["-"] (ALPHA / DIGIT))

item-value      =       1*angle-addr / addr-spec / atom / domain / msg-id

optional-field  =       field-name ":" [FWS] unstructured CRLF

field-name      =       1*ftext

ftext           =       %d33-57 /               ; Any character except
                       %d59-126                ;  controls, SP, and
                                               ;  ":".

obs-qp          =       "\" (%d0-127)
[N.B. unnecessary]

obs-text        =       %d0-127
[N.B. original 2822 specification was as obs-utext in this file, which permitted multiple characters]

obs-char        =       %d0-9 / %d11 /          ; %d0-127 except CR and
                       %d12 / %d14-127         ;  LF

obs-utext       =       *LF *CR *(obs-char *LF *CR)
[N.B. was obs-text]

obs-phrase      =       word *(word / ("." [CFWS]))

obs-phrase-list =       phrase / (1*([phrase] "," [CFWS]) [phrase])

obs-FWS         =       1*WSP *(CRLF 1*WSP)

obs-date-time = [ day-name [CFWS] "," [CFWS]] obs-date [CFWS] FWS [CFWS] obs-time [CFWS] [N.B. obs- rule does not provide for adjacent date and time permitted by RFC 822]

obs-date        =       day CFWS month-name CFWS obs-year
[N.B. obs- rule does not permit (e.g.) 1Jan2001 which was permissible under RFC 822]

obs-year        =       2*DIGIT

obs-time        =       obs-time-of-day CFWS (zone / obs-zone)
[N.B. obs- rule does not permit adjacent time and zone, which was permissible under RFC 822]

obs-time-of-day = hour [CFWS] ":" [CFWS] minute [CFWS] ":" [[CFWS] second]

obs-zone        =       "UT" / "GMT" /          ; Universal Time
                                               ; North American UT
                                               ; offsets
                       "EST" / "EDT" /         ; Eastern:  - 5/ - 4
                       "CST" / "CDT" /         ; Central:  - 6/ - 5
                       "MST" / "MDT" /         ; Mountain: - 7/ - 6
                       "PST" / "PDT" /         ; Pacific:  - 8/ - 7
                       %d65-73 /               ; Military zones - "A"
                       %d75-90 /               ; through "I" and "K"
                       %d97-105 /              ; through "Z", both
                       %d107-122               ; upper and lower case

obs-angle-addr  =       "<" [CFWS] [obs-route] addr-spec ">" [CFWS]

obs-route       =       obs-domain-list ":" [CFWS]

obs-domain-list = "@" [CFWS] domain *(1*("," [CFWS]) "@" [CFWS] domain)

obs-local-part  =       word *("." [CFWS] word)

obs-domain      =       atom *("." [CFWS] atom)

obs-mbox-list   =       1*([mailbox] "," [CFWS]) [mailbox]

obs-addr-list   =       1*([address] "," [CFWS]) [address]

obs-fields = *(obs-return / obs-received / obs-orig-date / obs-from / obs-sender / obs-reply-to / obs-to / obs-cc / obs-bcc / obs-message-id / obs-in-reply-to / obs-references / obs-subject / obs-comments / obs-keywords / obs-resent-date / obs-resent-from / obs-resent-send / obs-resent-rply / obs-resent-to / obs-resent-cc / obs-resent-bcc / obs-resent-mid / obs-optional)

obs-orig-date   =       "Date" *WSP ":" [CFWS] date-time CRLF

obs-from        =       "From" *WSP ":" [CFWS] mailbox-list CRLF

obs-sender      =       "Sender" *WSP ":" [CFWS] mailbox CRLF

obs-reply-to    =       "Reply-To" *WSP ":" [CFWS] address-list CRLF

obs-to          =       "To" *WSP ":" [CFWS] address-list CRLF

obs-cc          =       "Cc" *WSP ":" [CFWS] address-list CRLF

obs-bcc         =       "Bcc" *WSP ":" [CFWS] [address-list] CRLF

obs-message-id  =       "Message-ID" *WSP ":" [CFWS] msg-id CRLF

obs-in-reply-to = "In-Reply-To" *WSP ":" [CFWS] *(phrase / msg-id) CRLF

obs-references  =       "References" *WSP ":" [CFWS] *(phrase / msg-id) CRLF

obs-msg-id      =       "<" [CFWS] addr-spec ">" [CFWS]

obs-subject = "Subject" *WSP ":" [FWS] [("cmsg" / "Re:") [FWS]] unstructured CRLF
[ RFC 1036 sect. 2.2.6 "cmsg" hack, 2.1.4 "Re:" (w/ or w/o space) ]

obs-comments    =       "Comments" *WSP ":" [FWS] unstructured CRLF

obs-keywords    =       "Keywords" *WSP ":" [CFWS] obs-phrase-list CRLF

obs-resent-from =       "Resent-From" *WSP ":" [CFWS] mailbox-list CRLF

obs-resent-send =       "Resent-Sender" *WSP ":" [CFWS] mailbox CRLF

obs-resent-date =       "Resent-Date" *WSP ":" [CFWS] date-time CRLF

obs-resent-to   =       "Resent-To" *WSP ":" [CFWS] address-list CRLF

obs-resent-cc   =       "Resent-Cc" *WSP ":" [CFWS] address-list CRLF

obs-resent-bcc  =       "Resent-Bcc" *WSP ":" [CFWS] [address-list] CRLF

obs-resent-mid  =       "Resent-Message-ID" *WSP ":" [CFWS] msg-id CRLF

obs-resent-rply =       "Resent-Reply-To" *WSP ":" [CFWS] address-list CRLF

obs-return      =       "Return-Path" *WSP ":" [CFWS] path CRLF

obs-received = "Received" *WSP ":" [CFWS] name-val-list [ ";" [CFWS] obs-date-time ] CRLF
[N.B. RFC 822 required date-time stamp]
[N.B. reference online version of 2822 specification does not permit WSP before colon if date-time stamp is used; RFC 822 permitted (nay, required) "Received" *WSP ":" [CFWS] name-val-list ";" [CFWS] obs-date-time CRLF
]

obs-path        =       obs-angle-addr

obs-optional    =       field-name *WSP ":" [FWS] unstructured CRLF

--------------------------------------------------------------------------------
Notes not part of modified grammar:

For LR(1) parser compatibility, lexical tokens are grouped such that trailing
WS, FWS, or CFWS is associated with its preceding lexical token.  Therefore,
no lexical token handled by the higher-level parser grammar rules has any
ambiguity associated with optional WS, FWS, or CFWS. So, where this revised
grammar has:

   obs-mbox-list   =       1*([mailbox] "," [CFWS]) [mailbox]

that is handled by the implementation as:

   obs-mbox-list   =       1*([mailbox] ("," [CFWS])) [mailbox]


Additional rules such as:

   start           =       (":" [FWS]) / obs-start
   obs-start       =       *WSP ":" [FWS]
   cstart          =       (":" [CFWS]) / obs-cstart
   obs-cstart      =       *WSP ":" [CFWS]
   dstart          =       start / obs-cstart

can be used to reduce the number of rules, e.g.:

   orig-date       =    "Date" dstart date-time CRLF
(eliminating obs-orig-date (also applies to resent-date))
   subject         =       "Subject" start ["cmsg" [FWS]] unstructured CRLF
(eliminating obs-subject (start also applies to comments and optional-field))
   from            =       "From" cstart mailbox-list CRLF
(eliminating obs-from (cstart applies to remaining header fields))
etc., allowing all of the obs- header fields to be eliminated, and obs-fields to
be simplified.


And adding:

   resent          =       "Resent-"

allows:

   resent-from     =       resent from

etc., allowing the resent- fields to be simplified and ensuring that the
definitions remain in sync between base and resent- versions.