procmail
[Top] [All Lists]

Re: Use scoring to determine header format?

2004-05-17 23:07:42
At 22:31 2004-05-17 -0400, fleet(_at_)teachout(_dot_)org wrote:
On Mon, 17 May 2004, Professional Software Engineering wrote:

> At 16:44 2004-05-17 -0400, fleet(_at_)teachout(_dot_)org wrote:
> >I'm seeing spam messages that appears to be from one individual (or
> >perhaps one software) that have a specific header format as:

[header format snipped]

> matching the above as-is certainly doesn't mandate using scoring to achieve it.

I'm not sure what you're saying here.  I tried, without success:

* Received:
* Received:
* Received:
* Message-id:
* Received:


One condition line, no scoring:

:0
* ^Received:.*^Received:.*^Received:.*^Message-id:.*^Received:


> * ^(From|Date|Subject|Reply-To):(.*$)+Received:

This works; but doesn't restrict the matches with respect to number (of
course).  But now I'm confused about the '+'.  Here it seems to be
concatenation and not "one or more."

The point was, one condition line, and it crosses header lines.

> There's no RFC which declares that Received headers must appear before others.

And that answers my other question!  Thank you.

Well, that doesn't mean that using it as a spammy trait isn't useful - just as certain keywords are more spammy than others. However, traits which find an equal distribution amongst spam and legitimate traffic are a bit more difficult to justify.

:0
* -4^0
* 1^0 ^Received:(.*$)+Received:
* 1^0 ^Received:(.*$)+Received:
* 1^0 ^Received:(.*$)+Message-Id:
* 1^0 ^Message-Id:(.*$)+Received:
* 1^0 ^Received:

The first two regexp conditions will match the SAME two received headers. If you really want three in a row, why not just add a third Received in ONE condition? If you were to duplicate the condition line a third time, you'd still be matching on TWO received lines (and there's no requirement here that they be BEFORE the Message-ID, or consecutive).

The final condition will match on any old received header, and is just about GUARANTEED to match on every email that passes through your system (at least via SMTP - a local delivery directly from some app into your LDA won't insert, but that supposes that something is bypassing the MTA to do so).

FURTHER - the (.*$)+ expression will match *MULTIPLE* intermediate header lines. Thus, the following will match your complete recipe:

Received: blah
Message-Id: blah
From: yea_not_part_of_the_condition
Recieved: blah

Those two received lines meet the first and second conditions, the first received and the message-id meet the third condition, the message-id and the skip-then-recieved line meets the fourth condition, and the FIRST received line is going to match your final condition.

If you want three receiveds, a message-id, and a fourth received, scoring isn't part of the picture - a single-line unified regexp is:

:0:
* ^Received:(.*$)Received:(.*$)Received:(.*$)+Message-Id:(.*$)+Received:
spew.mbx


In English:

Three received lines in IMMEDIATE SUCCESSION (no intermediate headers), then optionally other headers (the + following the third received expression), then the Message-Id:, followed by optional intermediate headers (again, the +), followed by another Received:

Lose the + expressions if you actually want the series to be consecutive headers without intermediate fluff.

There's no scoring, as it really isn't applicable here, and it's only confusing the matter for you.

Increasingly, I find that unwanted email doesn't really carry a lot of extra Received headers - sure, some spammers still think it'll throw the scent, but so many seem to be spamming directly from broadband accounts nowadays instead of spoofing through other servers (many of which get blocked by DNSBLs).

The problem is - How do I say in the last condition "Received: followed by
NOT Received.  I tried * 1^0 ^Received:(.*$)+[^Received:], which didn't

Uhm, why do you need to do this? Shouldn't the Message-id:(.*$)^Received: match your last two header conditions just fine? if you really want something following the final received header, you can add (.*$)(.*$) to the regexp I gave above, or you can but a characer class inversion:

[^R][^e][^c][^i][^e][^v][^e]^[d][^:]

Which I think is wholly unnecessary unless you really believe there's an issue where there will be ONLY one Received; after the Mesage-Id, but MULTIPLE such headers are kosher.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail