Okay, I'll weigh in on this topic.
Christopher Lindsey <lindsey(_at_)ncsa(_dot_)uiuc(_dot_)edu> writes:
...
* Disallow spaces and tabs after @ and require at least one char
I *think* the RFC says that spaces and tabs are allowed, if
the domain is enclosed in quotes (which strikes me as just plain
weird). But it's also 2:30 in the morning and I'm reading RFCs,
which strikes me as equally strange.
Actually, you can always place linear-white-space around the @. To quote
rfc822:
3.1.4. STRUCTURED FIELD BODIES
To aid in the creation and reading of structured fields, the
free insertion of linear-white-space (which permits folding
by inclusion of CRLFs) is allowed between lexical tokens.
* Permit whitespace after closing broket
I'm pretty sure that this isn't allowed. The RFC implies everything
that's allowed up to the closing bracket, but doesn't explicitly say
"No, you have to stop here." So maybe you're right. I'm
going to leave mine without this allowance and see what happens...
Since the terminating CRLF is a "lexical token", LWSP is allowed before
it.
Any other problems?
Can also have escaped carriage returns.
That's not all: comments may be inserted between any two lexical
tokens. I don't recall ever seeing a comment in a message-id line, but
it is technically allowed. I would suggest ignoring the possibilty, as
the syntax for commentless message-ids can be matched with a regexp.
...
I'm ashamed to say that I've been testing the filters on all incoming
mail at our site (with my boss's permission, of course), so I get
a fairly wide range of messages as my testbed (about 20,000 messages
daily).
Hmm, that's tempting. I'll have to check with my boss about doing that
here...
To comment on two message-ids which have been discussed previously in
this thread:
Message-ID: <M10250103.005.z2j35.1.980317164507Z
.CC-MAIL*/O=HQ/PRMD=USDOE/ADMD=ATTMAIL/C=US
/@MHS>
Illegal: the second line fold is in a illegal location (the slash right
before the at-sign should be at the end of the previous line)
Message-Id: <"OP-MIME expo400:439*""
<allanm(_at_)op(_dot_)x400(_dot_)icl(_dot_)co(_dot_)uk>"@MHS>
Illegal: the two quoted strings should either be merged into one or
separated with a period
That said, let's do a quick work-up of a 'more complete' regexp to
match message-ids. I'll cite syntax lines from rfc822 with regexps
that should match them. For ease of presentation, I'm going to work
from the bottom up. Note: any brackets that only contain whitespace
should really contain a space and a tab.
atom = 1*<any CHAR except specials, SPACE and CTLs>
[-!#-'*+/-9=?A-Z^-~]+
The specials are: []()<>@,;:\". The above character
class is done the way it is, instead of as an inversion
of the specials, so that it can exclude NUL. Procmail
doesn't let you put a NUL in a regexp.
quoted-pair = "\" CHAR ; may quote any char
\\.
It is not clear whether it is legal to end a line with
a backslash. The IETF working-group that's working on
the revision of rfc822 and 821 had disallowed it, last
time I checked. Note that the above does allow a backslash
right before a folded, as it'll look like a space to
procmail's regexp engine.
qtext = <any CHAR excepting <">, ; => may be folded
"\" & CR, and including
linear-white-space>
[^"\]
quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
; quoted chars.
"([^"\]|\\.)*"
word = atom / quoted-string
("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)
local-part = word *("." word) ; uninterpreted
; case-preserved
("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([ ]*\.[
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*
Note that linear-white-space is allowed between tokens,
so we must let it appear around the period.
It is generally accepted that the syntax given for a "domain" is more
liberal than was intended. In particular, it allows the following:
foo.[blagh skdjfhskjfhsdkjfskjf].baz
It is agreed that the intention was to only allow "domain-literals"
(bracketed expressions) as entire "domains", not as "sub-domains".
This will be corrected in the revision, so I'm going to follow that
syntax which runs something equivalent to:
dtext = <any CHAR excluding "[", ; => may be folded
"]", "\" & CR, & including
linear-white-space>
[^][\]
sub-domain = atom ; symbolic reference
[-!#-'*+/-9=?A-Z^-~]+
domain-literal = "[" *(dtext / quoted-pair) "]"
\[[ ]*([^][\]|\\.)*[ ]*\]
No, the trailing close bracket doesn't really need
to be escaped, but I'm going to anyway.
domain = domain-literal / ( sub-domain *("." sub-domain) )
(\[[ ]*([^][\]|\\.)*[ ]*\]|[-!#-'*+/-9=?A-Z^-~]+([ ]*\.[
]*[-!#-'*+/-9=?A-Z^-~]+)*)
addr-spec = local-part "@" domain ; global address
("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([ ]*\.[
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*[ ]*(_at_)[ ]*(\[[
]*([^][\]|\\.)*[ ]*\]|[-!#-'*+/-9=?A-Z^-~]+([ ]*\.[
]*[-!#-'*+/-9=?A-Z^-~]+)*)
msg-id = "<" addr-spec ">" ; Unique message id
<[ ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([ ]*\.[
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*[ ]*(_at_)[ ]*(\[[
]*([^][\]|\\.)*[ ]*\]|[-!#-'*+/-9=?A-Z^-~]+([ ]*\.[
]*[-!#-'*+/-9=?A-Z^-~]+)*)[ ]*>
So, that line is it for matching commentless message-ids. To make it into a
condition and break it up into semi-readable chunks:
:0
* ! ^Message-Id:[ ]*<[ ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)\
([ ]*\.[ ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*\
[ ]*(_at_)[ ]*\
(\[[ ]*([^][\]|\\.)*[ ]*\]|\
[-!#-'*+/-9=?A-Z^-~]+([ ]*\.[ ]*[-!#-'*+/-9=?A-Z^-~]+)*)\
[ ]*>
Hmm, maybe I'll start logging ids that match that condition.
Philip Guenther