Re: bad message id's


Okay, I'll weigh in on this topic.

Christopher Lindsey <lindsey(_at_)ncsa(_dot_)uiuc(_dot_)edu> writes:
...

  * Disallow spaces and tabs after @ and require at least one char


I *think* the RFC says that spaces and tabs are allowed, if
the domain is enclosed in quotes (which strikes me as just plain
weird).  But it's also 2:30 in the morning and I'm reading RFCs,
which strikes me as equally strange.


Actually, you can always place linear-white-space around the @.  To quote
rfc822:

     3.1.4.  STRUCTURED FIELD BODIES
 
        To aid in the creation and reading of structured  fields,  the
        free  insertion   of linear-white-space (which permits folding
        by inclusion of CRLFs)  is  allowed  between  lexical  tokens.

  * Permit whitespace after closing broket


I'm pretty sure that this isn't allowed.  The RFC implies everything
that's allowed up to the closing bracket, but doesn't explicitly say
"No, you have to stop here."  So maybe you're right.  I'm 
going to leave mine without this allowance and see what happens...


Since the terminating CRLF is a "lexical token", LWSP is allowed before
it.

Any other problems?


Can also have escaped carriage returns.


That's not all: comments may be inserted between any two lexical
tokens.  I don't recall ever seeing a comment in a message-id line, but
it is technically allowed.  I would suggest ignoring the possibilty, as
the syntax for commentless message-ids can be matched with a regexp.


...

I'm ashamed to say that I've been testing the filters on all incoming
mail at our site (with my boss's permission, of course), so I get
a fairly wide range of messages as my testbed (about 20,000 messages
daily).


Hmm, that's tempting.  I'll have to check with my boss about doing that
here...

To comment on two message-ids which have been discussed previously in
this thread:

        Message-ID:  <M10250103.005.z2j35.1.980317164507Z
         .CC-MAIL*/O=HQ/PRMD=USDOE/ADMD=ATTMAIL/C=US
                      /@MHS>

Illegal: the second line fold is in a illegal location (the slash right
before the at-sign should be at the end of the previous line)


        Message-Id: <"OP-MIME expo400:439*"" 
<allanm(_at_)op(_dot_)x400(_dot_)icl(_dot_)co(_dot_)uk>"@MHS>

Illegal: the two quoted strings should either be merged into one or
separated with a period



That said, let's do a quick work-up of a 'more complete' regexp to
match message-ids.  I'll cite syntax lines from rfc822 with regexps
that should match them.  For ease of presentation, I'm going to work
from the bottom up.  Note:  any brackets that only contain whitespace
should really contain a space and a tab.

     atom        =  1*<any CHAR except specials, SPACE and CTLs>
[-!#-'*+/-9=?A-Z^-~]+

                The specials are: []()<>@,;:\".  The above character
                class is done the way it is, instead of as an inversion
                of the specials, so that it can exclude NUL.  Procmail
                doesn't let you put a NUL in a regexp.

     quoted-pair =  "\" CHAR                     ; may quote any char
\\.
                It is not clear whether it is legal to end a line with
                a backslash.  The IETF working-group that's working on
                the revision of rfc822 and 821 had disallowed it, last
                time I checked.  Note that the above does allow a backslash
                right before a folded, as it'll look like a space to
                procmail's regexp engine.

     qtext       =  <any CHAR excepting <">,     ; => may be folded
                     "\" & CR, and including
                     linear-white-space>
[^"\]

     quoted-string = <"> *(qtext/quoted-pair) <">; Regular qtext or
                                                 ;   quoted chars.
"([^"\]|\\.)*"

     word        =  atom / quoted-string
("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)

     local-part  =  word *("." word)             ; uninterpreted
                                                 ; case-preserved
("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([        ]*\.[   
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*
                Note that linear-white-space is allowed between tokens,
                so we must let it appear around the period.


It is generally accepted that the syntax given for a "domain" is more
liberal than was intended.  In particular, it allows the following:

        foo.[blagh skdjfhskjfhsdkjfskjf].baz

It is agreed that the intention was to only allow "domain-literals"
(bracketed expressions) as entire "domains", not as "sub-domains".
This will be corrected in the revision, so I'm going to follow that
syntax which runs something equivalent to:

     dtext       =  <any CHAR excluding "[",     ; => may be folded
                     "]", "\" & CR, & including
                     linear-white-space>
[^][\]

     sub-domain  = atom                         ; symbolic reference
[-!#-'*+/-9=?A-Z^-~]+

     domain-literal =  "[" *(dtext / quoted-pair) "]"
\[[     ]*([^][\]|\\.)*[        ]*\]
                    No, the trailing close bracket doesn't really need
                    to be escaped, but I'm going to anyway.

     domain      = domain-literal / ( sub-domain *("." sub-domain) )
(\[[    ]*([^][\]|\\.)*[        ]*\]|[-!#-'*+/-9=?A-Z^-~]+([    ]*\.[   
]*[-!#-'*+/-9=?A-Z^-~]+)*)


     addr-spec   =  local-part "@" domain        ; global address

("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([        ]*\.[   
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*[     ]*(_at_)[       ]*(\[[  
]*([^][\]|\\.)*[        ]*\]|[-!#-'*+/-9=?A-Z^-~]+([    ]*\.[   
]*[-!#-'*+/-9=?A-Z^-~]+)*)


     msg-id      =  "<" addr-spec ">"            ; Unique message id
<[      ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)([      ]*\.[   
]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*[     ]*(_at_)[       ]*(\[[  
]*([^][\]|\\.)*[        ]*\]|[-!#-'*+/-9=?A-Z^-~]+([    ]*\.[   
]*[-!#-'*+/-9=?A-Z^-~]+)*)[     ]*>



So, that line is it for matching commentless message-ids.  To make it into a
condition and break it up into semi-readable chunks:

:0
* ! ^Message-Id:[       ]*<[    ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+)\
        ([      ]*\.[   ]*("([^"\]|\\.)*"|[-!#-'*+/-9=?A-Z^-~]+))*\
        [       ]*(_at_)[       ]*\
        (\[[    ]*([^][\]|\\.)*[        ]*\]|\
         [-!#-'*+/-9=?A-Z^-~]+([        ]*\.[   ]*[-!#-'*+/-9=?A-Z^-~]+)*)\
        [       ]*>

Hmm, maybe I'll start logging ids that match that condition.

Philip Guenther