Re: Why does this trap invalid Message-IS's?

On Tue, 25 Aug 1998 10:56:46 +0100, Lars Hecking
<lhecking(_at_)nmrc(_dot_)ucc(_dot_)ie> wrote:

Felix Tilley writes:

Can someone explain (either via the list, or via email reply) why this
recipe traps invalid message ID's?  This recipe is rather bewildering to
me.  I can't tell from the log file if really deletes spam.  But I have
deleted only two genuine email messages in two years, so it must be benign.

 The origin of that regexp is probably heuristic. It basically traps
 headers of the format
 Message-Id:somewhitespace<somestring(_at_)someotherstring>


"Traps" is perhaps misleading; it will pass through those, and block
anything not matching the above.

# This will allegedly trap invalid message ID's
:0
* !^Message-Id:[\t ]+<("[^"]+"|[^ <>@]+)@[^<>]*>$

 one or more space or tab
 plus angle bracket
 plus either
         double quotes followed by a string of at least one char, not
         containing a double quote, plus double quote
      or
         a string of at least one char not containing space, angle brackets,
         or @
 plus @
 plus a string of zero or more chars not containing angle brackets
 plus closing angle bracket
 plus newline


As one of the more or less likely originators of the above recipe (I
see some things I don't offhand recognize as necessarily being mine,
and of course I've seen other people come up with something similar
independently of me), perhaps I can explain what I wanted to trap with
mine. It was borne out of the following observations:

  * Lots of spam fails (or, at one point, failed) to include the angle
    brackets around the Message-Id
  * Lots of spam would have Message-Id: <> i.e. empty brackets
  * Lots of spam didn't come with a Message-Id at all

All of these can conveniently be covered with a single condition.
Later I amended it with the following observations:

  * Lots of spam fails to include a @ between the angle brackets.
    (Some legit software seems to do this as well but I really don't care.)
  * Lots of spam, and some generally misconfigured software, leaves
    the part after the @ empty. (This is not covered by the above recipe.)

Finally, on this very list, somebody remarked that whitespace is
allowed in a Message-Id if properly quoted. The above recipe only
makes a half-hearted stab at fixing that. (Whitespace in the domain
part should technically be allowed, although DNS of course doesn't
permit any whitespace in real domain names.)

Some silly limitations I would probably attempt to fix in the above
regular expression:

  * Whitespace after Message-Id: is optional
  * Whitespace after the closing bracket should be allowed. 
    (Somebody actually had a problem with this.)
  * Doesn't check against more than one @

But anyhow, if you're serious about Message-Id checks, you should be
using Phil Guenther's regex. See below.

/dev/null

 Trashing email on the basis of bad Message-Id: is a bad idea.


I always try to dissuade people from throwing away stuff, even spam,
but those warnings tend to be left out when people pass on the
recipes. Especially a recipe like the above, which can and will trap
some mail from legit but misconfigured servers, should take care to
put it in a safe place for inspection. (And anybody serious about
combatting spam should take care to complain about every single piece
of spam they receive.)

 For an excellent, human-readable understanding of the format of
 RFC 822 Message-Id:'s and translation into a regexp, search the
 procmail archives for Philip Guenther's mail from Mar 19 1998
 Subject: Re: bad message id's.


Definitely check this out if you haven't already. Here's the URL again:
  <http://www.xray.mpe.mpg.de/mailing-lists/procmail/1998-03/msg00268.html>

/* era */

-- 
Bot Bait: It shouldn't even matter whether  (`')  Just  (`')  http://www.iki
I am a resident of the State of Washington   \/ Married! \/   .fi/~era/