procmail
[Top] [All Lists]

Re: Two simple recipes that don't work

2004-02-29 11:50:29
On Sun, 29 Feb 2004, Robert Krueger wrote:

:0HB:
* ^Content-Type:.*text/html
blue-spam

 Two questions.  Why did you put an asterisk before the "text/html"?

The important thing is that it's after the "." -- ".*" means match any
number of repetitions (including zero) of anything.  The "*" is what
means "any number of repetitions" of whatever came before.

Also, I was told there's some kind of procmail bug that doesn't like the
"HB" right after the "0" ( :0HB: )

There's a bug with "H" specifically -- in some versions of procmail, once
the "H" flag has appeared it is never cleared again, so all subsequent
recipes act as though they also have "H".

Instead, I was advised to use an alternate format like this: (I think)
:0 :
* HB ^Content-Type:.*text/html
blue-spam

Is that correct?

The idea is right but the syntax is wrong.

:0 :
* HB ?? ^Content-Type:.*text/html
blue-spam

Although I think that's a bit extreme as a condition all by itself, as
there are any number of ordinary email applications that might generate
HTML as a body part.  Or someone might write an ordinary sentence that
mentions "Content-Type: text/html" and happens to wrap such that the
phrase lands at the beginning of a line.  You could avoid misclassifying
the latter with a scoring recipe:

:0 :
* -1^0
*  2^0 H ?? ^Content-Type:.*text/html
*  1^0 H ?? ^Content-Type:.*multipart
*  1^0 B ?? ^Content-Type:.*text/html
blue-spam

This means:

Start with a negative score.  If text/html is in the header, add 2 to the
score.  If multipart is in the header, add 1 to the score.  If text/html
is in the body, add 1 to the score.  Thus only messages either that have
text/html in the header, or that have BOTH multipart in the header AND
text/html in the body, have a positive score and so are a match.

The 2^0 could be replaced with a very large score (see "man procmailsc"  
for the actual maximum score -- a common idiom is to write 9876543210^0)
to short-circuit the scoring at that point and thus avoid the body scan.  
The "H ??" are actually not needed as that's the default.  And if the
message has no Content-Type at all you can skip the whole thing.  So an
"optimized" version might be:

:0 :
* ^Content-Type:\/.*
*  9876543210^0   MATCH ?? text/html
* -9876543210^0 ! MATCH ?? multipart
*  9876543210^0       B ?? ^Content-Type: text/html
blue-spam

Clear as mud?


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail