ietf-asrg
[Top] [All Lists]

Re: [Asrg] Meaningless words in spams

2004-02-16 12:34:07
At 02:19 PM 2/16/2004 -0500, you wrote:
At 6:01 PM +0000 2/16/04, Matt Schneider wrote:
At 12:28 PM 2/16/2004 -0500, you wrote:
so, i guess a sliding window would catch filter-busting headers and trailers.

No, they add a bunch of garbage right in the body of the spams too, fake HTML tags or text that's the same color as the background.

There's no real way to avoid this stuff.

Those are both really quite easy to catch, and can even be caught by automatic learning filters. For example, the word 'oblivity' inside angle brackets (i.e. a bogus HTML tag) occurs nowhere at all in any of my legitimate mail of the past year. It occurs 6 times in my spam of 2004. A filter that checks for strict HTML compliance in HTML mail would have caught all of those, and I see in my current set of Bayesian classifiers that this 'word' (complete with <>) is part of why the later spams containing it were marked as probable spam. Similarly, text that is the same color as the background is a programmatically detectable trick, and there are already filters in use that detect it as spamsign.

Yes, you can look for bad HTML tags and colored text.. but there will always be a way to sneak garbage in no matter what you do.

I also note in peeking at my current Bayesian classifiers that there are many perfectly valid but uncommonly used words there which seem to be strong spamsign for no obvious reason. At least no obvious reason until I look at where in my recent spam they have appeared: the filterbusting attempts that use random dictionary words. A quick browse of the 200k entries in my filter collection and the accuracy it shows leads me to believe that the spammers who try to break filtering are still losing the arms race and may not ever hit on a winning tactic.

Yes, but you can't start filtering mail based on uncommon words either. One time I was corresponding with a reporter about a spam article and got one of my messages bounced back... I guess I used the word "spam" too many times (like all spammers do, of course). The point is, content filtering sucks.

- Matt


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>