procmail
[Top] [All Lists]

Re: Garbage vs Valid

2003-02-01 12:33:36
"Tony L. Svanstrom" <tony(_at_)svanstrom(_dot_)com> wrote:

On Sat, 1 Feb 2003 the voices made fleet(_at_)teachout(_dot_)org write:

Is there any way to differentiate (in procmail) between
a random collection of letters/numbers and "valid"
words/acronyms/abbreviations?

 Not using only procmail, but there are several solutions to finding 
out if something is just garbage or a name/word; both statistical and
simply= searching a dictionary.

I do some rudimentary checks in procmail.  They give me some success.

Here, for example, is one that looks for not-entirely-short From: addresses
that have no vowels or no consonants, which is, well, just weird.

 :0  # 021203 () sender's longish local address has no vowels or no consonants
  * $  $GO^0  ! LOCALPART ?? [$VOWELS]
  * $  $GO^0  ! LOCALPART ?? [$CONSONANTS]
  *             LOCALPART ?? [a-z]
  *             LOCALPART ?? .....
  { RX = "${RX:+$RX, }UBE.FR.!(VOWEL|CONSONANT)" }


You need to know that $LOCALPART is a private variable I've set that contains
the local part of the sender's putative address.  $GO is an "oversaturated"
supremum value; and $VOWELS and $CONSONANTS, well, should be obvious.

This is a fairly low-hit recipe.  It caught four of the last 100 of my
spam messages.  But while it obviously has some exposure for false pozzes,
it is surprisingly stable in that regard.

I have some other recipes that are either more experimental or sufficiently
more complex (and ugly) that I don't feel like posting them -- but one
looks, for example, for strings of too many consonants and numbers at the end 
of 
Subject-lines after extra space, and so on.  That one catches a lot!

-- 
dman


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>