Re: Spam blacklist (in)efficiency?

At 15:33 2002-03-08 +0300, Odhiambo G. Washington did say:

Break it down:

BLACKLIST="(`perl -p0e '


from 'perl --help':

-p              assume loop like -n but print line also, like sed
-0[octal]       specify record separator (\0, if no argument)
-e 'command'    one line of program (several -e's allowed, omit programfile)

                        s/\012(.)/\|$1/g;


s/orig/new/flags

s = replacement operation
g = flag for "global" (i.e. repeat as necessary)

\012 is NEWLINE.

(.) says "followed by _something_ (versus the very last newline). Havingit in parens says "group this", which assigns it a replacement token of $1.


replacement text is \| (literal |) and the original text ($1)

What that does is takes every line in the file:

jack[newline]
jill[newline]
fred[newline]

and converts it:

jack|jill|fred

(note no | after fred)

                        s/\015//g;

Octal 15 is CR, and the replacement text is nothing, so this is simplyweeding out carriage returns in case you have a DOS type file (which willalso have the newlines).

                        s/\|{2,}/\|/g;

This regexp says if we have two or more '|' in a row (which might be causedby a blank line), just replace it with a single '|'.

                        s/\./\\\./g' < $PMDIR/black.lst`)"

Lastly, take literal '.' and replace them with a literal '\.', becausewe're GENERATING a regexp.

The redirection of the black.lst file here is as an argument TO perl, thatpart isn't IN the code itself (which ends at that single tic, not thebacktic). The whole expression is encapsulated in parens, so when we'redone, the resuling expression will be:


(jack|jill|fred)

# Check banlist
:0:
* ! (^From|^To):.*($lists)
* ? formail -xFrom: | fgrep -if $BLACKLIST
SPAM

I think it'd make a LOT more sense to ditch fgrep and hand over the FROMargument to a perl script (after all, you're _ALREADY_ invoking perl). Ata bare minimum, if the above code is the ONLY place you check the blacklist(and lets assume you weren't having the immediate line length problem), youshould restructure it:


# Check banlist
:0
* ! (^From|^To):.*($lists)
{
        # perform perl evaluation as above.

        :0:
        * ? formail -xFrom: | fgrep -if $BLACKLIST
        SPAM
}

This would then execute the perl code *ONLY* when it has been determinedthat the $BLACKLIST is actually needed. Otherwise, you're probably runningthat on nearly every message - including the ones that match the $listsexpression.

A technique used by some members (though not myself), is to actually emit agenerated regexp INTO a procmailrc (with the appropriate trimmings) whichyou then INCLUDERC. In doing this, you'd need to properly lock the fileoperation, and realize that just to run the test, you must create a file,which is that much more overhead.

"The file name is too long" ...

-f argument to grep says the following argument should be a filename whichit will run through each line and grep against the passed text (versusgrepping the passed text against a file). That seems patently WRONG forwhat you're doing (as well as using fgrep to do it). I would think BEFOREyou started getting the above error, that you should have been seeingsomething like:


fgrep: (jack|jill|fred): No such file or directory

In fact, I wondered how you could POSSIBLY not be seeing this, so I set upa test and sure enough, that's EXACTLY what was logged to the procmail log.


Can you actually say that this EVER worked?  You have confirmed results?

procmail: Non-zero exitcode (2) from " formail -xFrom: | fgrep -if $BLACKLIST"

when this recipe is run. The $BLACKLIST has grown to almost 1000 lines !!!

Which are all expanded onto the commandline - so take the SIZE of theblacklist file and figure you have a commandline which probably slightlyexceeds this, because of the added tokens (the stuff the perl script isdoing). That's a LONG commandline.

I am hoping that there is an  efficient way to check the banlist, even by
using some db????

I use a program I wrote called megagrep. But it doesn't do regularexpressions - it's a purpose-written word list handler (for potentiallyVERY large wordlists, which grep is VERY inefficient with). With it, I amable to quickly grep for words within the headers - such as hostnames inthe received lines, or addresses in the From: or recipient lists. That's aC++ program, so it isn't nearly as slow as an interpreted program (AND theprogram overhead of the interpreter itself), and because it uses balancedAVL trees to hold the data, it is efficient with memory usage and very fastwhen searching for something. My spam.dat file (containing domains Irefuse email from) is about 3 MB, and this just zips through that, wheregrep would just hog memory (LOTS of it), and take forever to process themessage (headers).

I'd love to help you further, but I really must get some of my OWN workdone here. Hopefully the above evaluation will at least steer you onto theright track.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail