At 11:41 AM 8/17/97 -0500, Philip Guenther wrote:
:0 hir
* ?egrep -isFf $BLOCKDOM
/dev/null
Here's the variant I'm using:
($FORMAIL is defined to the path to the formail executable, $FGREP to the
fgrep executable, and $SPAMLIST to the file, with spam domains on
individual lines):
:0
* ? $FORMAIL -ISubject: | $FGREP -i -f $SPAMLIST
/dev/null
(with the recent addition of the more or less complete cyberpromo domain
list, my spamlist (domains) alone is at 843 entries. Among other lists
matched in a similar fashion, I also have a twitlist - which is just
addresses/name components to be matched against address-type headers)
This matches everything occurring in the headers except for the subject
(that is, when looking for a match, the contents of the subject aren't
considered - this keeps us from matching on subjects that might contain
references to some spammer domain - such as occurs when discussing spam),
and doesn't give a rats arse about the CaSe of the strings.
Performance-wise (minus my additional overhead of formail), is there a big
difference between the two invocations of (f)grep? I know that I'm taking
a big performance hit by grepping out all the basic spam domains (OTOH,
look at all the extra disk space! :) ).
My spam domain list is not currently comprised of regexp'd items. I'm
considering changing it so that all items are matched on a preceeding word
break or period only. (" <.(@" mostly), since not doing so could be a
problem with some domains which end up being a shorter form of another
domain (menioned here in this group have been usa.net vs netusa.net - if I
filter for the first one, I'll whack the second).
---
Please DO NOT carbon me on list replies. I'll get my copy from the list.
Sean B. Straw / Professional Software Engineering
Post Box 2395 / San Rafael, CA 94912-2395