procmail
[Top] [All Lists]

Re: Keep getting subborn spam with random words

2004-03-09 21:32:11

(Sorry this is getting pretty far afield for this list.  I'm just trying to
following my thoughts to a logical conclusion.  If you want to read more
about this the Spambayes website <http://www.spambayes.org/> or the archives
of the spambayes or spambayes-dev mailing lists are good places to look for
more detail.)

    >> In the Spambayes project this stuff is called "word salad".  (I doubt
    >> the term originated with us.)  The one conclusion we've reached so
    >> far about it is that it generally doesn't bother the accuracy of our
    >> classifier.

    Jay> I guess that its effect on your classifier would depend upon
    Jay> several variables, but unless you have a way to exclude them from
    Jay> being classified they will affect the performance of your
    Jay> classifier. The effect may be subtle if your classifier already has
    Jay> a large corpus.

Essentially none of these nonsense words (as well as the not nonsense words
which seem to be chosen at random) affect the scoring at all because they
don't exist in most training databases.  By default, Spambayes ignores any
words with spam probabilities between 0.4 and 0.6.  Words(*) which don't
already exist in the database are assigned a spamprob of 0.5, so they are
ignored by the classifier.  I don't know how other classifiers treat words
with spamprobs near 0.5, but if they don't ignore them then I agree word
salad could be a problem.

    Jay> Beyond that - why do you think someone would bother to send these
    Jay> things?

To fool systems which use checksum techniques to identify duplicate messages
perhaps or systems which assign greater spam probability to very short small
messages.

Short messages seem to be the best way to move scoring into the unsure range
(not conclusively spam or ham) because the tokenizer just doesn't generate a
lot of tokens for the classifier.  There is a fairly well-understood way to
fool a bayesian classifier though.  If you can associate a paragraph or two
of meaningful text with an email address (Google for "skip(_at_)pobox(_dot_)com 
python"
to find examples of what I might find hammy) and tack that paragraph onto a
short spam message you'd have a good chance of overwhelming the spammy words
with a bunch of hammy words and thus fool the classifier.  Fortunately for
us that's an expensive proposition for the spammer.  Anything static the
spammer chose would eventually get seen in enough trained spam that those
words which used to be hammy would now be unsure (seen in about the same
number of scored spam and ham) or spammy (seen in significantly more spam
tham ham) and thus not help the spammers cause.  That means that for each
spam sent to skip(_at_)pobox(_dot_)com he'd have to come up with a new 
paragraph full
of hopefully hammy words.  *That* gets expensive and kills his bottom line,
so thusfar we haven't seen much of that sort of thing.  In addition, I
suspect googling for most email addresses wouldn't yield anything useful.

Skip

(*) We normally use the term "token" instead of "word" to be more accurate,
since our tokenizer synthesizes lots of "structural" tokens which don't
correspond one-to-one to actual words in the message.

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail