Jay> But again, including these "word salad"/"zombie"/"hammy" inputs in
Jay> the training corpus for a Bayesian classifier will have an effect
Jay> on how it performs. It would be snake oil to claim otherwise.
Sure, if you train on such messages as either ham or spam it will affect
things. (Note that I don't train on every message I receive, so not all
word salad makes it into my training database. Most of us who are Spambayes
developers train on something which approximates false negatives, false
positives and unsure messages.) Most of the time such words are things I've
never seen or trained on before, so they don't register. Training on one
message containing such nonsense words will affect scoring of future
messages which contain them, but since they are more-or-less random strings
or words they are unlikely to be seen again unless they happen to be part of
my normal email language. Here's the beginning of a dump from my training
database for tokens containing strings of six to twelve consonants:
% spamcounts -d tte.db -r '[bcdfghjklmnpqrstvwxz]{6,12}' | head
db: tte.db
token nspam nham spam prob
kfsdqvhom 1 0 0.844827586207
ugtsfmqj 1 0 0.844827586207
ivtcgkqab 1 0 0.844827586207
jkfrvsbx. 1 0 0.844827586207
ptcvplr 1 0 0.844827586207
mbvzpwrj 1 0 0.844827586207
dspkfnwd, 1 0 0.844827586207
Note that each of them has appeared only once. Should they ever be seen
again (I have no reason to think they will) they will make such messages
look more spammy than if they had never been seen.
Here's a little word salad plucked from my current spam folder:
calls drypis aureate virile
abate spliced pedigree wifely
brahm betise bejeweled
I suspect they are all valid English words though I don't recognize some of
them. If I ask my database which ones it knows about it responds:
% spamcounts -d tte.db -r
'^(calls|drypis|aureate|virile|abate|spliced|pedigree|wifely|brahm|betise|bejeweled)$'
db: tte.db
token nspam nham spam prob
wifely 1 0 0.844827586207
calls 6 10 0.354734930723
virile 1 0 0.844827586207
So out of eleven words, three were recognized. Only one was mildly hammy.
Still that message scored as spam (overall spam probability of 0.93 on a
scale of 0.0 to 1.0). If you look at the individual tokens it used to
compute the final weighted spam probability for the message:
X-Spambayes-Evidence: '*H*': 0.00; '*S*': 0.85; 'grab': 0.09; 'notation':
0.09;
'sticking': 0.16; '(this': 0.23; 'copies': 0.24;
'subject:our': 0.25; 'subject:when': 0.25; 'door': 0.28;
'code': 0.28; 'calls': 0.30; 'cds': 0.31; 'kind': 0.33;
'code.': 0.34; 'windows': 0.35; 'header:Received:9': 0.37;
'manual': 0.37; 'works': 0.37; 'charset:us-ascii': 0.38;
'corporate': 0.38; 'received:208.58': 0.38;
'received:208.58.1': 0.38; 'received:208': 0.39; 'because': 0.61;
'super': 0.61; 'ware': 0.61; 'delivered': 0.61;
'received:manatee.mojam.com': 0.61; 'received:mojam.com': 0.61;
'received:24': 0.61; 'absolute': 0.62; 'pro': 0.62; 'our': 0.64;
'yourself': 0.64; 'header:Reply-To:1': 0.64; 'countries': 0.66;
'registration': 0.66; 'ship': 0.66; 'receive': 0.66;
'to:addr:musi-cal.com': 0.66; 'content-type:text/html': 0.67;
'sell': 0.69; 'oem': 0.69; 'unique': 0.71; 'cost': 0.72;
'price': 0.73; 'subject:software': 0.75; 'virus': 0.77;
'low': 0.80; 'soft': 0.81; '15.00': 0.84; 'airmail': 0.84;
'collet': 0.84; 'parasite': 0.84; 'pr0': 0.84;
'received:dyn.optonline.net': 0.84;
'received:optonline.net': 0.84; 'virile': 0.84; 'wifely': 0.84;
'url:php': 0.85; 'deluxe': 0.91; 'to:addr:itineraries': 0.92;
'url:re': 0.93; 'printing': 0.95; 'url:biz': 0.99
you'll see that these were only three of more than 50 features the tokenizer
extracted from the message and that the classifier used to compute the
probability.
Does word salad have an effect? Probably, but for Spambayes at least it's
effect is at best modest.
--
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://spambayes.sf.net/
skip(_at_)pobox(_dot_)com
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail