Re: Separate incoming mail into 4 categories

At 22:43 2007-01-04 +0800, DR. Lee - NS3 wrote:

I had meant to critique (for the purposes of educating) some of the logic 
in your originally posted recipe:

:0
* ^To:.*<.+>
* $ ? grep ^To:.* |gawk -F '<' '//{print $2}' |gawk -F \> '//{print $1}'
|grep -f to_list
{

[snip]

So, these conditions run and let's say the address is successfully 
extracted, but doesn't match the to_list, you'll fall through to the next 
recipe - which isn't for bracketed To:, but will attempt to extract the 
header just the same.

        * ^From:.*<.+>
        * $ ? grep ^From:.* |gawk -F '<' '//{print $2}' |gawk -F \>
'//{print $1}' |grep -f from_list


This sub-recipe condition launches off assuming that the From: header will 
in fact be formatted with brackets just like the To: was.  If it ISN'T, 
then the from_list won't be queried, and the message will drop through to 
MATCH_TO even if the (unbracketed) address would have beeen found in from_list.

* $ ? grep ^To:.* |gawk '//{print $2}' |grep -f to_list


Now, if the address WAS in brackets, but just wasn't matched in the 
to_list, we'll have fallen through to the second recipe group, which will 
process the To: header as if it didn't have encapsulating brackets -- even 
if in fact, it did.  Which means we suffer all the extraction again, but 
fail to get a token we should expect to find in the to_list file, so this 
condition predictably fails (when the address is bracketed).

        * ^From:.*
        * $ ? grep ^From:.* |gawk -F '<' '//{print $2}' |gawk -F \>
'//{print $1}' |grep -f from_list
        ! $MATCH_BOTH


Curiously, this inspection of the From: again expects it to be bracketed, 
even though it is within an outer condition for an unbracketed (though not 
confirmed to be unbracketed) To:.  Uh, so you have NO support for the 
entirely legal address syntax shown on the From: header of the messages 
I've been sending to the Procmail list for the past (gaak!) 11+ years.

Ok, so I haven't changed my posting style much in over a decade, and that 
certainly doesn't mean it is predominant - but it does remain LEGAL 
formatting.  Lest you think of me as some lone loon, the following 
significant procmail contributors have at some point used the same From: 
formatting:  DWT, TJL, and (drumroll ...) SRB.

If you review the recipe I posted, you'll see that I use an extraction 
which passes through formail, which helpfully strips the address of comment 
tokens and whatnot, reducing it to a simple address.  No brackets, no muck.

:0
* ^From:.*
* ? grep $MATCH -f from_list
! $MATCH_FROM


Now, if neither of the To: conditions matched, we go to extract and check 
the from address by its lonesome - but, er, only in it's unbracketed 
form.  If in fact it was bracketed, this won't match against the 
file.  Well, and that's assuming you'd properly extracted a MATCH in the 
first condition: it is devoid of the \/ match construct (which would still 
grab the field complete with comments).  The grep operation therefore will 
be searching from_list for whatever MAY have been in $MATCH from some prior 
rulset somewhere in your procmailrc.

:0
! $MATCH_NEITHER


I expect a lot of messages would have delivered here which were not 
intended to, based on the above issues.

In the worst-case scenario, your rulesets would invoke a 
grep|gawk|gawk|grep, fail on that, then hit the second recipe and do a 
grep|gawk|grep, fail on that, then fail on the final grep of the From: 
(which itself isn't good), only to fall through to MATCH_NEITHER.  OR, fail 
the first sequence, match the second, and then perform a 
grep|gawk|gawk|grep, and either fail or succeed with that (which would be 
the most CPU intensive: 6 greps, 5 gawks, and STILL not likely to properly 
match a fair number of messages).

That's a LOT of processes.

My offered approach invoves a pipe to formail for the CLEANFROM extraction, 
and an echo|sed|tr pipe for the To: extraction and cleaning (and this 
pipeline is pretty lightweight, unlike even a single invocation of 
gawk).  Then two singular grep operations (no shell pipelines).  In my 
environment, CLEANFROM is executed for all messages anyway, because it's a 
useful extraction, used in various places (that's why it is in my 
sandbox).  Bottom line: my recipe will run about half as many processes, 
and will do so quite consistently (as it isn't a series of alternate forms 
to accomodate different input formats) - that is, whether an address is or 
isn't in your files, the processing power necessary to check will be quite 
consistent.  With your approach, if they are in it with the first form of 
formattting, they may match with _just_ 4 greps and 4 gawks (8 processes, 
not including shells).  If they're in it with the second formatting (or not 
at all), it'll be more processes (nevermind accurracy of the 
expressions).  If you get a lot of mail, all those cycles add up.

The issues outlined above don't apply to the solution I offered yesterday 
(though certainly in the process of testing it, you might find some other 
issues).  I offer the above criticisms so that you might review them and 
see some of the errors in the original implementation which prevented it 
from functioning as you had hoped.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail