Re: RFC validation samples?


On Aug 06 2004, Keith Moore wrote:

There's a funny connection between these types of protocol elements,
spam and statistical learning.


only in the short term.  eventually spammers learn to make their message
headers look "just like" legitimate messages.

these days, spammers seem to learn more quickly, and upgrade their 
software more often, than MUA and MTA writers.


This isn't necessarily a disadvantage. If a spammer message header
evolves to the correct format, statistical filtering decisions will
naturally gravitate towards the differences in the mail transport
path, I expect. 

Moreover, we shouldn't forget why spammer software sometimes use a
token such as "jAn" instead of the common form "Jan". To the transport
mechanism (ie according to the RFCs), both "Jan" and "jAn" are
identical, so there is no harm in using one or the other. But widely
deployed filters use a hash of the message header, or parts of the
message body, etc.  When hashing a header containing "Jan", you get a
different answer than if the header contains "jAn", allowing the spam
to slip by. Now obviously in this example it would be easy to
normalize occurrences of "Jan" before hashing, but this explains why
spammer software likes to add nonstandard elements in mail messages. 

If spammer messages look more "legitimate", they become vulnerable to 
more defenses. And of course, for a statistical filter which updates/learns
continuously from spam examples, whether messages start or stop using tricks
such as "jAn" is just another fact for adjusting the weights, there's no need
to change any algorithms at all.

-- 
Laird Breyer.