procmail
[Top] [All Lists]

Re: Spam blacklist (in)efficiency?

2002-03-08 11:27:13
At 15:33 2002-03-08 +0300, Odhiambo G. Washington did say:

Break it down:

BLACKLIST="(`perl -p0e '

from 'perl --help':

-p              assume loop like -n but print line also, like sed
-0[octal]       specify record separator (\0, if no argument)
-e 'command'    one line of program (several -e's allowed, omit programfile)

                        s/\012(.)/\|$1/g;

s/orig/new/flags

s = replacement operation
g = flag for "global" (i.e. repeat as necessary)

\012 is NEWLINE.
(.) says "followed by _something_ (versus the very last newline). Having it in parens says "group this", which assigns it a replacement token of $1.

replacement text is \| (literal |) and the original text ($1)

What that does is takes every line in the file:

jack[newline]
jill[newline]
fred[newline]

and converts it:

jack|jill|fred

(note no | after fred)

                        s/\015//g;

Octal 15 is CR, and the replacement text is nothing, so this is simply weeding out carriage returns in case you have a DOS type file (which will also have the newlines).

                        s/\|{2,}/\|/g;

This regexp says if we have two or more '|' in a row (which might be caused by a blank line), just replace it with a single '|'.

                        s/\./\\\./g' < $PMDIR/black.lst`)"

Lastly, take literal '.' and replace them with a literal '\.', because we're GENERATING a regexp.

The redirection of the black.lst file here is as an argument TO perl, that part isn't IN the code itself (which ends at that single tic, not the backtic). The whole expression is encapsulated in parens, so when we're done, the resuling expression will be:

(jack|jill|fred)

# Check banlist
:0:
* ! (^From|^To):.*($lists)
* ? formail -xFrom: | fgrep -if $BLACKLIST
SPAM

I think it'd make a LOT more sense to ditch fgrep and hand over the FROM argument to a perl script (after all, you're _ALREADY_ invoking perl). At a bare minimum, if the above code is the ONLY place you check the blacklist (and lets assume you weren't having the immediate line length problem), you should restructure it:

# Check banlist
:0
* ! (^From|^To):.*($lists)
{
        # perform perl evaluation as above.

        :0:
        * ? formail -xFrom: | fgrep -if $BLACKLIST
        SPAM
}

This would then execute the perl code *ONLY* when it has been determined that the $BLACKLIST is actually needed. Otherwise, you're probably running that on nearly every message - including the ones that match the $lists expression.

A technique used by some members (though not myself), is to actually emit a generated regexp INTO a procmailrc (with the appropriate trimmings) which you then INCLUDERC. In doing this, you'd need to properly lock the file operation, and realize that just to run the test, you must create a file, which is that much more overhead.

"The file name is too long" ...

-f argument to grep says the following argument should be a filename which it will run through each line and grep against the passed text (versus grepping the passed text against a file). That seems patently WRONG for what you're doing (as well as using fgrep to do it). I would think BEFORE you started getting the above error, that you should have been seeing something like:

fgrep: (jack|jill|fred): No such file or directory

In fact, I wondered how you could POSSIBLY not be seeing this, so I set up a test and sure enough, that's EXACTLY what was logged to the procmail log.

Can you actually say that this EVER worked?  You have confirmed results?

procmail: Non-zero exitcode (2) from " formail -xFrom: | fgrep -if $BLACKLIST"

when this recipe is run. The $BLACKLIST has grown to almost 1000 lines !!!

Which are all expanded onto the commandline - so take the SIZE of the blacklist file and figure you have a commandline which probably slightly exceeds this, because of the added tokens (the stuff the perl script is doing). That's a LONG commandline.

I am hoping that there is an  efficient way to check the banlist, even by
using some db????

I use a program I wrote called megagrep. But it doesn't do regular expressions - it's a purpose-written word list handler (for potentially VERY large wordlists, which grep is VERY inefficient with). With it, I am able to quickly grep for words within the headers - such as hostnames in the received lines, or addresses in the From: or recipient lists. That's a C++ program, so it isn't nearly as slow as an interpreted program (AND the program overhead of the interpreter itself), and because it uses balanced AVL trees to hold the data, it is efficient with memory usage and very fast when searching for something. My spam.dat file (containing domains I refuse email from) is about 3 MB, and this just zips through that, where grep would just hog memory (LOTS of it), and take forever to process the message (headers).


I'd love to help you further, but I really must get some of my OWN work done here. Hopefully the above evaluation will at least steer you onto the right track.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>