Rejo Zenger <subs(_at_)sisterray(_dot_)xs4all(_dot_)nl> writes:
Similar discussions has been before on the list, but there's still one
thing i don't understand. I have now in my rc the following things
(copied from previous postings to this list):
dq = '"'
bw = "\\"
ws = "[ ]*"
atom = "[-!#-'*+/-9=?A-Z^-~]+"
word = "($atom|$dq([^$dq\]|$bw.)*$dq)"
local_part = "$word($ws\.$ws$word)*"
domain = "(\[$ws([^][\]|$bw.)*$ws\]|$atom($ws\.$ws$atom)*)"
And:
:0
*$ ! ^Message-Id:$ws<$ws$local_part$ws@@$ws$domain$ws>
{
nl
nl = ${notice+"$NL"}
notice = "$notice${nl}X-NOTE: Did not have a RFC compliant Message-
ID field."
}
When using these, i get pretty much emails which are matched in this
recipe, so that means there are a lot of invalid Message-Id's used, or
there's something wrong in my recipe. I tried to look things up and
match it with RFC822, but one thing i cannot find there are these
whitespaces. IMHO, but maybe i'm overlooking a thing, whitespaces are
not allowed in a Message-Id. But, if i understand the above correctly,
this check allows. Where am i going wrong?
While it true that the high-level syntax of a Message-Id: header does
not mention comments or whitespace, this is because they both disappear
during the lexical analysis. To quote rfc822, section 3.1.2:
Note: Any field which has a field-body that is defined as
other than simply <text> is to be treated as a struc-
tured field.
Then in section 3.1.4:
To aid in the creation and reading of structured fields, the
free insertion of linear-white-space (which permits folding
by inclusion of CRLFs) is allowed between lexical tokens.
Then follows a percise listing of the lexical tokens of a structured
header field.
The reason your condition is match too often is that the at-sign is
doubled in it:
*$ ! ^Message-Id:$ws<$ws$local_part$ws@@$ws$domain$ws>
^^
^^
Remove one of those.
Finally, I'll note that rfc822 actually allows comments in Message-Id:
headers (indeed, comments are one of the lexical tokens listed in section
4.1.4). While it is impossible to match arbitrarly nested parens with
a regular expression, it is simple to match one level of parens, and
given that there's a Banyan Vines MTA that includes a comment in the
local part of the Message-Id: header, I would recommend changing the
'ws' definition to the following:
ws="[ ]*(\([^()]*\)[ ]*)?"
(Yes, that _could_ be
ws="[ ]*(\([^()]*\)[ ]*)*"
but I have yet to see a Message-Id: header with two comments in a row,
and I don't feel like that much slack to a loser MTA/MUA writers.)
Philip Guenther