procmail
[Top] [All Lists]

Re: Keep getting subborn spam with random words

2004-03-10 10:42:14

    Jay> But again, including these "word salad"/"zombie"/"hammy" inputs in
    Jay> the training corpus for a Bayesian classifier will have an effect
    Jay> on how it performs. It would be snake oil to claim otherwise.

Sure, if you train on such messages as either ham or spam it will affect
things.  (Note that I don't train on every message I receive, so not all
word salad makes it into my training database.  Most of us who are Spambayes
developers train on something which approximates false negatives, false
positives and unsure messages.)  Most of the time such words are things I've
never seen or trained on before, so they don't register.  Training on one
message containing such nonsense words will affect scoring of future
messages which contain them, but since they are more-or-less random strings
or words they are unlikely to be seen again unless they happen to be part of
my normal email language.  Here's the beginning of a dump from my training
database for tokens containing strings of six to twelve consonants:

    % spamcounts -d tte.db -r '[bcdfghjklmnpqrstvwxz]{6,12}' | head
    db: tte.db
    token       nspam   nham    spam prob
    kfsdqvhom   1       0       0.844827586207
    ugtsfmqj    1       0       0.844827586207
    ivtcgkqab   1       0       0.844827586207
    jkfrvsbx.   1       0       0.844827586207
    ptcvplr     1       0       0.844827586207
    mbvzpwrj    1       0       0.844827586207
    dspkfnwd,   1       0       0.844827586207

Note that each of them has appeared only once.  Should they ever be seen
again (I have no reason to think they will) they will make such messages
look more spammy than if they had never been seen.

Here's a little word salad plucked from my current spam folder:

    calls drypis aureate virile 
    abate spliced pedigree wifely 
    brahm betise bejeweled 

I suspect they are all valid English words though I don't recognize some of
them.  If I ask my database which ones it knows about it responds:

    % spamcounts -d tte.db -r 
'^(calls|drypis|aureate|virile|abate|spliced|pedigree|wifely|brahm|betise|bejeweled)$'
    db: tte.db
    token       nspam   nham    spam prob
    wifely      1       0       0.844827586207
    calls       6       10      0.354734930723
    virile      1       0       0.844827586207

So out of eleven words, three were recognized.  Only one was mildly hammy.
Still that message scored as spam (overall spam probability of 0.93 on a
scale of 0.0 to 1.0).  If you look at the individual tokens it used to
compute the final weighted spam probability for the message:

    X-Spambayes-Evidence: '*H*': 0.00; '*S*': 0.85; 'grab': 0.09; 'notation': 
0.09;
            'sticking': 0.16; '(this': 0.23; 'copies': 0.24;
            'subject:our': 0.25; 'subject:when': 0.25; 'door': 0.28;
            'code': 0.28; 'calls': 0.30; 'cds': 0.31; 'kind': 0.33;
            'code.': 0.34; 'windows': 0.35; 'header:Received:9': 0.37;
            'manual': 0.37; 'works': 0.37; 'charset:us-ascii': 0.38;
            'corporate': 0.38; 'received:208.58': 0.38;
            'received:208.58.1': 0.38; 'received:208': 0.39; 'because': 0.61;
            'super': 0.61; 'ware': 0.61; 'delivered': 0.61;
            'received:manatee.mojam.com': 0.61; 'received:mojam.com': 0.61;
            'received:24': 0.61; 'absolute': 0.62; 'pro': 0.62; 'our': 0.64;
            'yourself': 0.64; 'header:Reply-To:1': 0.64; 'countries': 0.66;
            'registration': 0.66; 'ship': 0.66; 'receive': 0.66;
            'to:addr:musi-cal.com': 0.66; 'content-type:text/html': 0.67;
            'sell': 0.69; 'oem': 0.69; 'unique': 0.71; 'cost': 0.72;
            'price': 0.73; 'subject:software': 0.75; 'virus': 0.77;
            'low': 0.80; 'soft': 0.81; '15.00': 0.84; 'airmail': 0.84;
            'collet': 0.84; 'parasite': 0.84; 'pr0': 0.84;
            'received:dyn.optonline.net': 0.84;
            'received:optonline.net': 0.84; 'virile': 0.84; 'wifely': 0.84;
            'url:php': 0.85; 'deluxe': 0.91; 'to:addr:itineraries': 0.92;
            'url:re': 0.93; 'printing': 0.95; 'url:biz': 0.99

you'll see that these were only three of more than 50 features the tokenizer
extracted from the message and that the classifier used to compute the
probability.

Does word salad have an effect?  Probably, but for Spambayes at least it's
effect is at best modest.

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://spambayes.sf.net/
skip(_at_)pobox(_dot_)com

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail