procmail
[Top] [All Lists]

Re: syntax for searching message body for adult words...

2002-01-28 17:17:40
At 15:11 2002-01-25 -0500, John Corniel wrote:
I have looked around but nothing.

Look harder.

I tried the following:

:0 B
* sex
AdultContentFolder

Aside from the fact that you don't specify a lock (trailing ':' on the flags line), what about this doesn't work? Have you run it in a sandbox with verbose logging (see my .sig)?

Please let me know the best way to go about this.

There are a lot of ways. One would be to use an or and a bunch of individual keywords:

:0B:
* (sex|porn|teen|gay|farm\ animals)
AdultContentFolder

Or using grep (as already mentioned) in conjunction with a file containing words:

:0B:
* ? fgrep -i -f somewordfile
AdultContentFolder

The grep operation, depending upon the size of your wordlist, or the message itself, and whether you choose to use the -w (word breaks) option, can suck up a lot of memory to run. That's not the fault of procmail - grep is a totally separate system utility, and the regexp processor there can get piggy.

Note also, that in both cases that you're very likely to catch messages such at will occur within this thread or anywhere there is the mere mention of a word - such as someone discussing the latest teen angst movie. That'd be true of probably anything one might come up with for a filter while discussing the filter itself. You might try using scoring (see 'man procmailsc' for more information on scoring), in which case you can assign weights to different words or phrases:

(please forgive the crudeness of some of the text which follows - and this is merely an excerpt!)

:0B:
* -75^0
* 30^1 (hardcore)
* 45^1 (\ sex|barely\ legal|\ xxx\ |adult|teen|porn|gay|erotic|orgy)
* 45^1 (lingerie|live\ (streaming\ |)(video|chat)|picture|gallery)
* 100^1 (viagra|sperm|\<jiz|\<jism|\<cum\>|orgasm|tits|lesbian)
AdultContentFolder

(FTR, this is a small excerpt_ of things I actually check against the subject, and I just set the weight here somewhat arbitrarily, so scoring is a bit skewed both for my own mail and for the value of such things in a subject, versus finding them in the body - in fact, you'll probably want to use an exponent >1).

Examine the sort of spew you get and add/remove keywords and weighting accordingly. I don't classify porno spam separate from regular spam, and I choose not to subscribe to any adult jokelists, so my filters can get pretty harsh without risking too much valid email (and the few lists which might use such language are fortunatley closed lists, and thus can be filtered before spam checks because they're not subject to outside submission).

This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or
distribution is prohibited.  If you are not the intended
recipient, please contact the sender by reply email and
destroy all copies of the original message

Oops. Okay, I've destroyed my copy. You'll have to contend with the copies of the message which have been archived on various publicly-accesible search archives.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail