At 15:33 2002-03-08 +0300, Odhiambo G. Washington did say:
Break it down:
BLACKLIST="(`perl -p0e '
from 'perl --help':
-p assume loop like -n but print line also, like sed
-0[octal] specify record separator (\0, if no argument)
-e 'command' one line of program (several -e's allowed, omit programfile)
s/\012(.)/\|$1/g;
s/orig/new/flags
s = replacement operation
g = flag for "global" (i.e. repeat as necessary)
\012 is NEWLINE.
(.) says "followed by _something_ (versus the very last newline). Having
it in parens says "group this", which assigns it a replacement token of $1.
replacement text is \| (literal |) and the original text ($1)
What that does is takes every line in the file:
jack[newline]
jill[newline]
fred[newline]
and converts it:
jack|jill|fred
(note no | after fred)
s/\015//g;
Octal 15 is CR, and the replacement text is nothing, so this is simply
weeding out carriage returns in case you have a DOS type file (which will
also have the newlines).
s/\|{2,}/\|/g;
This regexp says if we have two or more '|' in a row (which might be caused
by a blank line), just replace it with a single '|'.
s/\./\\\./g' < $PMDIR/black.lst`)"
Lastly, take literal '.' and replace them with a literal '\.', because
we're GENERATING a regexp.
The redirection of the black.lst file here is as an argument TO perl, that
part isn't IN the code itself (which ends at that single tic, not the
backtic). The whole expression is encapsulated in parens, so when we're
done, the resuling expression will be:
(jack|jill|fred)
# Check banlist
:0:
* ! (^From|^To):.*($lists)
* ? formail -xFrom: | fgrep -if $BLACKLIST
SPAM
I think it'd make a LOT more sense to ditch fgrep and hand over the FROM
argument to a perl script (after all, you're _ALREADY_ invoking perl). At
a bare minimum, if the above code is the ONLY place you check the blacklist
(and lets assume you weren't having the immediate line length problem), you
should restructure it:
# Check banlist
:0
* ! (^From|^To):.*($lists)
{
# perform perl evaluation as above.
:0:
* ? formail -xFrom: | fgrep -if $BLACKLIST
SPAM
}
This would then execute the perl code *ONLY* when it has been determined
that the $BLACKLIST is actually needed. Otherwise, you're probably running
that on nearly every message - including the ones that match the $lists
expression.
A technique used by some members (though not myself), is to actually emit a
generated regexp INTO a procmailrc (with the appropriate trimmings) which
you then INCLUDERC. In doing this, you'd need to properly lock the file
operation, and realize that just to run the test, you must create a file,
which is that much more overhead.
"The file name is too long" ...
-f argument to grep says the following argument should be a filename which
it will run through each line and grep against the passed text (versus
grepping the passed text against a file). That seems patently WRONG for
what you're doing (as well as using fgrep to do it). I would think BEFORE
you started getting the above error, that you should have been seeing
something like:
fgrep: (jack|jill|fred): No such file or directory
In fact, I wondered how you could POSSIBLY not be seeing this, so I set up
a test and sure enough, that's EXACTLY what was logged to the procmail log.
Can you actually say that this EVER worked? You have confirmed results?
procmail: Non-zero exitcode (2) from " formail -xFrom: | fgrep -if $BLACKLIST"
when this recipe is run. The $BLACKLIST has grown to almost 1000 lines !!!
Which are all expanded onto the commandline - so take the SIZE of the
blacklist file and figure you have a commandline which probably slightly
exceeds this, because of the added tokens (the stuff the perl script is
doing). That's a LONG commandline.
I am hoping that there is an efficient way to check the banlist, even by
using some db????
I use a program I wrote called megagrep. But it doesn't do regular
expressions - it's a purpose-written word list handler (for potentially
VERY large wordlists, which grep is VERY inefficient with). With it, I am
able to quickly grep for words within the headers - such as hostnames in
the received lines, or addresses in the From: or recipient lists. That's a
C++ program, so it isn't nearly as slow as an interpreted program (AND the
program overhead of the interpreter itself), and because it uses balanced
AVL trees to hold the data, it is efficient with memory usage and very fast
when searching for something. My spam.dat file (containing domains I
refuse email from) is about 3 MB, and this just zips through that, where
grep would just hog memory (LOTS of it), and take forever to process the
message (headers).
I'd love to help you further, but I really must get some of my OWN work
done here. Hopefully the above evaluation will at least steer you onto the
right track.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail