RE: [Asrg] Re: 2a. Analysis - Spam filled with words

The thing that makes these "new" messages different is precisely the 
fact that they do *not* contain the nonsense words/random characters 
typical of obfuscating comments.  Instead, they contain literally 
dozens of "high-end" *content-rich* words, deliberately left intact.

That's the "tell" (a poker term) that these messages are probably 
designed to confuse statistical language classifiers.  (Again, they 
don't work, won't work--and ultimately *can't* work, for reasons that
are interesting only to people like me.)  Admittedly based on a 
manual "training" run, the Bayesian component of my statistical 
filter started "catching" these after seeing just two of them.

I don't know that we can simply say that they don't work. For 
example, assume that the spammer knows the non-spammy words and
inserts those into the message. This can potentially significantly
affect the statistical scoring. Of course filters are combating
this by identifying this invisible text and ignoring it in the
calculation. However in the absence of such intelligence or when
those words are placed plainly in the text of the message, they 
have the ability to compensate for the spammy words.


Can I suggest a subtly different approach? Rather than trying to
characterise spam, why not try and characterise your legitimate 
messages and see if incoming messages match that statistical
profile?

My reasoning is based on the fact that the profile of spam 
undergoes sudden shifts as spammers switch to using new tactics 
each time their old ones become less effective. Whereas, in my
case anyway, the profile of the legitimate mail I receive is 
much more stable.

Bayesian classification systems have to undergo training in order
to learn what spam and "ham" look like. But because "spam" keeps
changing, so re-training is needed over time. As time passes, the
class of spam will grow and become less clearly-defined because
the range of tactics used by spammers seems to increase. As the
definition of "spam" becomes fuzzier, does the accuracy of
filtering decrease?

I'm particularly thinking about false positives here: given a
growing, varied, "spam" class compared to a more static "ham"
class - would it not make sense to match against a more stable
message profile?

Obviously each person's profile of received mail would be 
different but that approach has the distinct advantage that it 
makes it harder for a spammer to know what to put into their
messages to bypass filters because each person's profile of
allowed mail would be statistically unique to them.

This does of course require a smart enough feedback mechanism
(probably integrated with the receiver's MUA) to train the
filtering mechanism. But many of the proposals we have seen
would require changes at the MUA level, so it's no worse than
any of those.

Even if such an idea isn't viable, I would still be interested to
see how the statistical profile of spam differs from that of 
legitimate e-mail for a range of users and whether certain types
of metric are more reliable predictors of spam than others.

I have a few ideas for statistical spam characteristics but I must
admit that I lack the in-depth background in statistics to know if
any of them would work in practice. Some expert input here would
be welcome.

Thanks

Andrew

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg