RE: [Asrg] Re: 2a. Analysis - Spam filled with words

-----Original Message-----
From: Terry Sullivan [mailto:terry(_at_)pantos(_dot_)org] 
Sent: Wednesday, September 10, 2003 12:51 PM
To: asrg(_at_)ietf(_dot_)org
Subject: [Asrg] Re: 2a. Analysis - Spam filled with words


On Tue, 9 Sep 2003 13:45:51 -0400,
"Hector Santos" <winserver(_dot_)support(_at_)winserver(_dot_)com>

These are called "tag injections."   Its been around for 
awhile in HTML email.


Without meaning to seem disagreeable, I must disagree.  Hector's 
point that so-called tag injections/obfuscating comments have been 
around a while is well taken.  And he's quite correct that messages 
where the text is broken up or otherwise obfuscated are indeed 
intended to bypass simple keyword filters.

But in these "new" emails, exactly opposite is true.  The text is 
*not* broken up; on the contrary, it's perfectly intact, but "hidden" 
from *human* readers.  

The thing that makes these "new" messages different is precisely the 
fact that they do *not* contain the nonsense words/random characters 
typical of obfuscating comments.  Instead, they contain literally 
dozens of "high-end" *content-rich* words, deliberately left intact.  
That's the "tell" (a poker term) that these messages are probably 
designed to confuse statistical language classifiers.  (Again, they 
don't work, won't work--and ultimately *can't* work, for reasons that 
are interesting only to people like me.)  Admittedly based on a 
manual "training" run, the Bayesian component of my statistical 
filter started "catching" these after seeing just two of them.



I don't know that we can simply say that they don't work. For example,
assume that the spammer knows the non-spammy words and inserts those into
the message. This can potentially significantly affect the statistical
scoring. Of course filters are combating this by identifying this invisible
text and ignoring it in the calculation. However in the absence of such
intelligence or when those words are placed plainly in the text of the
message, they have the ability to compensate for the spammy words.

Beyond that, spammers continue to change their vocabulary to be closer to
the vocabulary of legitimate mail either based on: a) the output of
statistical filters, b) spam that attempts to resemble personal notes or c)
spam that attempts to resemble business dialogue. One question is how long
will we have the benefit of such a large distinction between the content of
spam and non-spam?   At what point along the line of the convergence of
these vocabularies does the effectiveness and accuracy of Bayesian filters
become affected? 

There are two different exercises here:
1. A measurement study of the vocabulary space of actual spam mail and
non-spam mail and the change in these spaces over time.
2. An analysis of how the effectiveness and accuracy of Bayesian filters
would be affected given certain measures of distinction between the two
vocabularies. There is probably some existing work here that gets us close
to an answer.

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg