Stupid perl Tricks

Hey all,

Sunday I wrote a perl script to scan and compare a corpus (corpi? dang, Ms.
Ingle would KILL me... probably corpses or something) of spam and good
mail. I wanted to answer some questions, specifically:


I'm using PerlJacket (duh, I wrote it), so is there anything special about
headers?

What does the distribution of tokens in headers, spam vs nonspam, look like?

Is it profitable to think about aggregate tokens, which is to say combining
tokens?


Output is a correlated list, which I then easily imported into Access (of
all things) and sliced and diced further.


Anyway, the results were interesting and not at all what I expected. But
they have allowed me to tune my autoresponder so that I have generated *no*
autoresponses to an obvious spam in the last 48 hours.


I ran the script on corpses of over 300 spams (specifically the spams which
got through my extant filtering) and 6000 good mails. Took about 4 minutes
on a K6 500 running SuSE Linux, perl 5.005somethingorother; maximum working
set was around 50Mb. Got about 250,000 records. Filtered down to the
records with at least a 0.05 ratio of occurrence on either good or spam,
there were less than 300. Winnowed down to the ones which were highly
correlative, about 20.


Should I post it (the script, not the results; do what thou wilt)? I will,
if asked. But the spy-vs-spy aspect of the spamming traveler makes me
reticent; they send me their works, try them out on me or something.
There's got to be a better way.

--

Fred Morris
m3047(_at_)inwa(_dot_)net


BWAHAHAHA! SPAMMER, YOU HATE YOURSELF! DIE SPAMMER DIE!!



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail