Re: Message-ID syntax

Rejo Zenger <subs(_at_)sisterray(_dot_)xs4all(_dot_)nl> writes:

Similar discussions has been before on the list, but there's still one
thing i don't understand. I have now in my rc the following things
(copied from previous postings to this list):

 dq = '"'
 bw = "\\"
 ws         = "[         ]*"
 atom       = "[-!#-'*+/-9=?A-Z^-~]+"
 word       = "($atom|$dq([^$dq\]|$bw.)*$dq)"
 local_part = "$word($ws\.$ws$word)*"
 domain     = "(\[$ws([^][\]|$bw.)*$ws\]|$atom($ws\.$ws$atom)*)"

And:

 :0
 *$ ! ^Message-Id:$ws<$ws$local_part$ws@@$ws$domain$ws>
 {
         nl
         nl      = ${notice+"$NL"}
         notice  = "$notice${nl}X-NOTE: Did not have a RFC compliant Message-

ID field."

 }

When using these, i get pretty much emails which are matched in this
recipe, so that means there are a lot of invalid Message-Id's used, or
there's something wrong in my recipe. I tried to look things up and
match it with RFC822, but one thing i cannot find there are these
whitespaces. IMHO, but maybe i'm overlooking a thing, whitespaces are
not allowed in a Message-Id. But, if i understand the above correctly,
this check allows. Where am i going wrong?


While it true that the high-level syntax of a Message-Id: header does
not mention comments or whitespace, this is because they both disappear
during the lexical analysis.  To quote rfc822, section 3.1.2:

        Note:  Any field which has a field-body  that  is  defined  as
               other  than  simply <text> is to be treated as a struc-
               tured field.

Then in section 3.1.4:

        To aid in the creation and reading of structured  fields,  the
        free  insertion   of linear-white-space (which permits folding
        by inclusion of CRLFs)  is  allowed  between  lexical  tokens.

Then follows a percise listing of the lexical tokens of a structured
header field.

The reason your condition is match too often is that the at-sign is
doubled in it:
        *$ ! ^Message-Id:$ws<$ws$local_part$ws@@$ws$domain$ws>
                                              ^^
                                              ^^

Remove one of those.


Finally, I'll note that rfc822 actually allows comments in Message-Id:
headers (indeed, comments are one of the lexical tokens listed in section
4.1.4).  While it is impossible to match arbitrarly nested parens with
a regular expression, it is simple to match one level of parens, and
given that there's a Banyan Vines MTA that includes a comment in the
local part of the Message-Id: header, I would recommend changing the
'ws' definition to the following:
        ws="[   ]*(\([^()]*\)[  ]*)?"

(Yes, that _could_ be
        ws="[   ]*(\([^()]*\)[  ]*)*"
but I have yet to see a Message-Id: header with two comments in a row,
and I don't feel like that much slack to a loser MTA/MUA writers.)


Philip Guenther