ietf-asrg
[Top] [All Lists]

Re: [Asrg] Spam detection system proposal

2003-03-05 13:00:58
From: "David F. Skoll" <dfs(_at_)roaringpenguin(_dot_)com>

What are the *detectable* differences between a spammer and a legitimate
mass mailer, assuming we can't read the minds of the recipients?

There are no such differences, detectable or not.

Then this ASRG is a waste of time.

I think this ASRG less about distinguishing spammers and legitimate
mass mailers than spam from legitimate mail.


...
I disagree, because many spammers work hard to remove bad addresses
from their target lists.

Really?  Don't you think it's worth a shot to try to gather hard data?
If you're right, then my idea is no good.  If you're wrong, then it is.
Unfortunately, without setting up a system to gather this data, we'll
never know.

If you look at a few 1000 spam, you'll find hard data proving that
many spammers do care about removing bad addresses.  Notice the whines
in spam about how wrong it is to complain to ISPs hosting remove
addresses.  Notice how many envelope Mail_From values are valid
addresses owned by the spammer at free ISPs.  Notice the spammers
using bulk mail packages (not spamware) that do automatic removal of
bad addresses using synthetic Mail_From values encoding the target
address. If spammers didn't want bounces or "removes", they would not
continually sign up for and use new free-provider drop-boxes.

Consider also reports that big ISPs automatically blacklist IP addresses
after they've sent to "too many" bad addresses.  This provides a strong
incentive for clean target lists.

Of course, there are plenty of other spammers who do not care how
dirty their target lists are.


Generalizations such as
"all spammers have lots of bad addresses in their lists" are as wrong
as "all spammers use open relays" or "spam involves forged headers."

I never said that.  I said I believe that many spammers have lots of bad
addresses, simply based on how they obtain addresses in the first place.
Maybe you're right; I don't know.  But we should at least try to find out.

"Finding out" seems to be based on the false notion that all spammers
do one or the other.  There are only about 1500 current serious spammers,
but their tactics and goals are quite varied.


I think the only way to detect spam runs is to examine passing mail bodies
and look for those that are substantially identical and therefore bulk.

Bulk != Spam.  

of course "spam" != "bulk", but "bulk" is a lot closer to "spam" than
"mentions viagra" or "was sent to a (temporarily) bad address."


               Any system to detect "similar but not identical"
messages can be thwarted if it uses a checksum scheme, and is too slow
to be practical if it uses more sophisticated message-closeness
measures.

You are quite wrong about that.  Simplistic checksum schemes are easily
thwarted, but the DCC and other checksum schemes including Cloudmark are
not simplistic checksums.  I've been computing what I call fuzzy checksums
since about 1997, and not yet had serious problems detecting bulk mail.
(In 1997 it was at a Fortune 500's corporate gateway.  Today it's the DCC.)

You are also quite wrong about the speed of checksumming with modern
cheap CPUS.  There are ISPs and others that are pushing a lot of
messages per day through DCC clients.  The significant costs are in
disk and network bandwidth, not CPU cycles.


Vernon Schryver    vjs(_at_)rhyolite(_dot_)com
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg