procmail
[Top] [All Lists]

Re: Garbage vs Valid

2003-02-01 13:13:24
PSE-L(_at_)mail(_dot_)professional(_dot_)org (Professional Software 
Engineering) wrote:

At 13:11 2003-02-01 -0500, fleet(_at_)teachout(_dot_)org did say:

I suppose it's "life experience" (or something); but "tatanka"
appears to be "ok" (with minor reservations) whereas "pxhkieb" and
"muhirhw" are (to me) immediately suspect.  How does one describe
this in code?

You could probably use character classes which define that say
that runs of more than two consonants are suspect.  Look at the
distribution of vowels and consonants within English words and you
might note that pattern.

Yes, exactly.  I tried yesterday, coincientally, looking for four-
plus consonants in a row, but there were lots of false pozzes.  However,
I'm not through with the experiment.  An X is not likely to follow a
P.  An H is perhaps more likely to follow an X; but a K thereafter?
Nah.  Also, how many words end "HW", let alone with another consonant
prepended?

You'd need to develop a scoring system for unlikely combinations, methinks.
It would take some programming algorithms in the prep-work to make it
right.  Then you'd score various unlikely combos and add up what you
have at the end.

Examples picked out of my head on the fly of unlikely start combos,
where 10 is "unlikeliest" and anything below 5, say, is too common
to bother looking for:

        * 9^0  SOMEHOST ?? ^^[bfp][bcdgjkmpqvwxz]

"If it starts with a B or an F or a P, it'd better not be followed
by any of those others, or, uh, it's just creepy."

Get to the end of your list and calculate your score.

Frankly, this seems overblown as a procmail approach.  I do some
weird-ass checks, though, I'll admit.  But I do try for either
obvious or high-penetration, and preferably both.

I think it'd be a lot of work to achieve and would still give false
hits, both with non-english language messages as well as encoded
hostnames.

If you limit it to blatantly improbably names, it wouldn't have so
many false pozzes, I'll bet.  But I agree that it would be a lot
of work, and likely for little gain.

I ran a scan against some miscellanious messages (NOT spam) revealed many 
exceptions:

In my four-plus-consonants test of last night, one that quicly got hit
as a false-poz was earthlink.


Some exceptions:
         apple
         ultra
         sepulchre
         bftoemail               (used by bigfoot)
         uprrsmtp1               (gaak, smtp alone trips it, several other

Yes, I took S out of my tests after the first round, and PH (together)
out thereafter.  Still too ugly to code for in procmail.

-- 
dman


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>