Twit Filtering (was RE: is this grep is correct?)

At 06:57 2000-09-28 -0400, Colin J. Raven wrote:

You have something really interesting here Sean.
A "twitlist" is easily maintainable, and this is an elegant approach IMHO.

Heh, to make it easier, I add users to the twitlist by emailing myself at aplussed address - the subject is extracted, and the address (or whatevertoken, really) from there is appended to the twitlist file. Removalrequires shelling and editing the file (only because I chose not to write ascript to automate that, which would be easy enough) - it gives me theincentive to say "is this person really repentant yet - if it isn't evenworth the time to shell and edit the file, I guess not".

1. Do you have something like: INCLUDERC=$PMDIR/twitrc?

Something like that. Actually, the twits script is included from anotwits.rc file. Same sort of thing for spam.rc -- the idea is thatinvariably, you're going to know someone who trips some rule, and you canset them up in a whitelist. For instance, one of my spam rules is whenFROM=TO. But I've a friend or two who uses that for self-made BCC:distribution lists for things (nevermind that if they understood mail, andhad an ISP worth a damn, they could use a plussed address for the TO:) -- Ican add their address to the nospam.dat file, and then they don't trip thatrule (or any other spam rules for that matter).



Here's a snippage from my boxes.rc (which itself is included into .procmailrc):


# (snip - a few specific administrative INCLUDERCs preceed these)

# Include rules for items to filter BEFORE spam
# (before twits as well - this happens to contain but ONE rule)
INCLUDERC=$PMDIR/prespam.rc

# Filter for TWITS (not spam, but individuals we don't take mail from)
# This is a select group.  Takes place BEFORE spam filtering as well as
# some other groups, because twits can exist even in groups normally free
# of spam.
INCLUDERC=$PMDIR/spam/notwit.rc

# Filter items which are clean of spam (outbound/moderated mailing lists,
# digests, and lists known to be subscriber only).  When a list starts
# getting spam, it gets moves from the mailclean.rc to a list below the
# spam filters.
#---------------------------------------------------------------------------

# Include rules for spam-free (clean) mailing lists
INCLUDERC=$PMDIR/mailclean.rc

#---------------------------------------------------------------------------

# Include rules for Spam - wrapped in an exception filter.
INCLUDERC=$PMDIR/spam/nospam.rc

# (snip - bulk of INCLUDERCs follow)

In the notwit.rc I have rules for submitting new addresses to both thetwits and notwits databases. the $SUBJECT which is used was extracted inthe .procmailrc using a match (as are a number of other often-usedvariables). While I take the processing hit to extract these(subject/to/from/sender) on EVERY incoming message, I also only take thehit ONCE, and don't have multiple variations of extraction scattered all about:


        :0
        * ^Subject:[    ]*\/[^  ].*
        {
                SUBJECT=$MATCH
        }

# (notwit.rc begin)

# ==========================================================================

# Define necessary variables.

# Define the directory all this spam filtering is in...
SPAMDIR=$PMDIR/spam

# The top two recipes allow me to mail an ADDITION to either list (I don't
# have removal code here, as I didn't need to develop it - once a spammer,
# always a spammer -- if someone deserves to be removed, then it'll be worth
# the effort of logging in and manually editing the twit file to remove their
# address - if not, then I don't apparently really want to hear from them).

# ==========================================================================

NOTWITLIST=$SPAMDIR/notwit.dat # non-twit database

# Define the address that twit exception submission should go to.  I have
# aliases for my accounts (either virtual host, or manual aliases in the
# system aliases file), but if you have plussed address support you can use
# that.
# if plussed, this definition needs DOUBLE escaping.
NOTWITSUB=userid\\+NOTWITplussedaddr(_at_)host\\(_dot_)domain\\(_dot_)tld

# Is it to my submission address?
# Subject field is address (hey, I'm cheap)
:0
* $ ^TO.*$NOTWITSUB
{
        LOG="NOTE: NoTwitSubmit: added $SUBJECT
"

        :0:
        |echo $SUBJECT >> $NOTWITLIST
}

# ==========================================================================

TWITLIST=$SPAMDIR/twits.dat # twit database

# Define the address that twit submission should go to.  I have aliases for
# my accounts (either virtual host, or manual aliases in the system aliases
# file), but if you have plussed address support you can use that.

# if plussed, this definition needs DOUBLE escaping.
TWITSUB=userid\\+TWITplussedadddr(_at_)host\\(_dot_)domain\\(_dot_)tld

# Is it to my submission address?
# Subject field is address (hey, I'm cheap)
:0
* $ ^TO.*$TWITSUB
{
        LOG="SPAM: TwitSubmit: added $SUBJECT"

        :0:
        |echo $SUBJECT >> $TWITLIST
}

# ==========================================================================

# If there is a match on any string in the notwitlist anywhere within the
# headers, excepting the subject line (typically, we would expect to find the
# match in one of the from, to, cc, messageid, or recieved lines), then SKIP
# twit filtering.
# Just like nospam.rc, this filtering will also catch the X-PSE-BYPASS header,
# but not actually being matched as a header per-se.  Allows us to reprocess
# messages which were originally caught by the spam filters, and which really
# should remain filtered, but we need to import one or two into the mail stream
# (say for ease of forwarding, or getting an attachment).

:0h
FAILKEY=| ($FORMAIL -ISubject: | $MEGAGREP -i -f $NOTWITLIST)

# If failkey is blank, we didn't match anything in the greenlist
:0
* $FAILKEY ?? ^^^^
{
        LOCKFILE=$TEMP/twitsrc$LOCKEXT
        INCLUDERC=$PMDIR/spam/twits.rc
        LOCKFILE
}

# ==========================================================================

# (notwit.rc end)

There is a virtually identical (except for "twit" vs "spam" in variousvariables) copy of this rcfile for nospam.rc.


The twits.rc (which is a *LOT* less complex than the spam.rc), is:

# (twits.rc begin)

# (Revision history omitted)

# This file filters out messages coming from twits (by address),

# The sendmail access.txt database is my current favoured method of dealing
# with the true twit.  There, I don't incurr the processing overhead of
# even an attempted local delivery and this filtering.  The sender also gets
# a bounce AT THE TIME OF THE SMTP TRANSACTION.

# Because of the differing matching performed between the twit and spam
# filtering, two, there are two distinct databases used.  See spam.rc for
# spam handling.

# ==========================================================================

# Define necessary variables.

# Define the directory all this spam filtering is in...
SPAMDIR=$PMDIR/spam

# Define the version of this filter, so we can emit a message to the log.
# when I tweak these rules, I tweak the version.  Simple.
TWITVER="
INFO: TwitFilter v02.00.00  PSE  2000.03.16 06:24:00
"

# ==========================================================================

# I realize this here is a spam filter (and is still present in the spam
# filters as well, in case twits were skipped for some reason), but this is
# singularily so important that we shouldn't skip it.

# From: header blank or not even present!
# Anybody mailing and not identifying a from, MUST be spamming.
:0
* ! ^From:[     ].+
{
        LOG="SPAM: No From:$TWITVER"

        :0:
        |gzip -9fc>>$MAILDIR/twits.gz
}

# ==========================================================================

NEUTLIST=$SPAMDIR/neutral.dat # neutral twit database

# If a SPECIFIC ADDRESS from the neutrallist appears anywhere within the
# headers, minus subject, and addressees, toss it.  Most people would
# probably choose to roll this together with the regular twit filtering,
# but I'm retentive here: these are messages of the type that "you elected
# to subscribe to our service, so we mail these notices out periodically".
# sort of spam really, but I don't want it in the spam database should I
# pump it through a process that might otherwise blacklist the submitter
# domain or email address...

:0h

FAILKEY=| ($FORMAIL -ISubject: -ITo: -ICc: -IResent-To: -IResent-Cc: |$MEGAGREP -i -f $NEUTLIST)


# If failkey is nonblank, we matched something.
:0
* ! $FAILKEY ?? ^^^^
{

LOG="SPAM: Neutral spam - ads from mailing lists and such[$FAILKEY].$TWITVER"


        :0:
        |gzip -9fc>>$MAILDIR/twits.gz
}

# We wouldn't autoreply to these...

# ==========================================================================

# If a SPECIFIC ADDRESS from the twitlist appears anywhere within the
# headers, minus subject, and addressees, toss it.

# Because twit filtering occurs before lists, even those which are filtered
# before spam checking, this allows us to catch morons who post spam or
# other drivel to mailing lists.

#FAILKEY=| ($FORMAIL -ISubject: -ITo: -ICc: -IResent-To: -IResent-Cc: |$MEGAGREP -i -f $TWITLIST)


# for *MY* purposes, it works just fine to zot all messages containing
# references to these emails.  Remember, they're a select group -- by
# inclusion of To: and Cc: headers, if I'm on a list, and one of these
# twits writes an initial message, I don't see it (from), *AND* if people
# reply on-list with cc/to: the twit, I won't see them, *AND* if
# in-reference-to type headers are present on replies (even if not copied
# to the individual, I shouldn't see those).  I thereby avoid most, if not
# all of the veritable sh*tstorm twits usually generate.

:0h
FAILKEY=| ($FORMAIL -ISubject: | $MEGAGREP -i -f $TWITLIST)

# If failkey is nonblank, we matched something.
:0
* ! $FAILKEY ?? ^^^^
{
        LOG="SPAM: Match against twitlist [$FAILKEY].$TWITVER"

        :0:
        |gzip -9fc>>$MAILDIR/twits.gz
}

# The following recipe can be used to auto-reply a twit message
# to the sender (provided that it is a valid address).
#
# To enable this, add the 'c' flag to the preceeding recipe (otherwise, this
# won't execute at all).  Arguably, if the preceeding recipe sends the
# message to /dev/null then you can simply concatenate the action lines of
# this recipe with the above recipe, replacing the /dev/null action, and
# eliminating the need for a 'c' on that recipe.
#
# Note that I don't have this enabled - when used for spam.rc, spammers
# simply don't read this stuff (if their mail is even valid), and most of
# the twits I run into are on mailing lists, so why should I waste my
# bandwidth sending it (and probably getting a bounced-back return)?
#

:0 Aw
  | ( $FORMAIL -rt -I "Precedence: notification" -I "From: $MAILBOT" ;\
   cat $AUTOREPLY/twit.msg ) | $SENDMAIL -t

# ==========================================================================

# (twits.rc end)

Another simplified application of greppage. I have about six or seven ofthese lists all in one RC, but each archiving to a different (gzipped)mailbox on the server, using a different datafile, and adding a differentMB header (used to simplyfy the filtering in my PC mail client - then allit cares about is what the MB header is).


# Friends

#
:0
* $? $FORMAIL -xFrom: | $FGREP -i -f $PMDIR/friends.dat
{
        :0c:
        | $FORMAIL -b -A"X-my-MB: FRIENDS" >> $DEFAULT

        :0:
        |gzip -9fc>>$MAILDIR/friends.gz
}

The delivery mechanism is specific to how I do things - I don't store theminto a mailbox on the server, but rather _archive_ them with GZIP, so gobsof email don't consume nearly as much disk space.

I guess this demonstrates how long I've been using this sort of filter:with the centralized from/to/subject/etc snarfing, the match line reallyshould be:


* $? echo $FROM | $FGREP -i -f $PMDIR/friends.rc

(unless someone sees a reason it shouldn't be this way)

2. Do you use this also as a "plonk"?


You'll have to describe what it is you mean by "plonk".


---
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

 Sean B. Straw / Professional Software Engineering
 Post Box 2395 / San Rafael, CA  94912-2395


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail