spf-discuss
[Top] [All Lists]

RE: [spf-discuss] Making spam scores public

2007-03-08 04:22:27
At 11:42 PM 3/6/2007 -0600, Seth Goodman wrote:
David MacQuigg wrote on Monday, March 05, 2007 4:27 PM -0600:

> It seems to me the real goal of our work on these reputation systems
> is to provide a universal solution to the spam problem.

While laudable, this is not possible.

Nattering Nabob of Negativism!!! :>)
http://en.wikipedia.org/wiki/Spiro_Agnew

> Keeping the data private means there is no fundamental difference
> between what you are doing and what any large ISP or spam-appliance
> company does.  How can you expect your solution to be any better
> than what these private companies are doing?

Having private data is a large advantage, which is why large ISP's don't
publish their internal listing criteria.  Attacking a small network that
uses local reputation data followed by Bayesian content filtering is
inherently harder than attacking a system using public DNSBL's and
content filters with public rule sets.  You can get some unwanted
messages through, but you can't test the messages for deliverability
ahead of time.  The only commonality among networks that use local data
is the code that generates it.  The data itself, and the system
parameters that drive the decisions, is all unknown to attackers.

I think we need to make a distinction between private reputation data and private rule sets or methods. I can see your point if we are talking about rule sets and methods, and I am inclined to agree, but not entirely convinced. The SpamAssassin folks argue that making their rule sets public is not a problem. If I understand their argument, it is that their rule sets are so large and hard to work around that it takes spammers months to adapt, and by that time, the next update on their rule set is available. I wish I had a link for you, but I do recall seeing a graph of spam vs time supporting this argument. Whatever the conclusion, it doesn't really matter, because we use SpamAssassin only to process messages that are not whitelisted, and as a means of generating reputation scores for domains that have not yet qualified to be whitelisted. SpamAssassin scores are plenty accurate for that purpose.

If spammers get very good at avoiding SpamAssassin's rule sets, we can switch to another filter, or even use different filters for different receivers, and keep the choices secret. As long as we get some feedback on the filter's decisions, we can include this feedback in an overall average rating for a domain.

As to the tactical advantage of keeping the reputation data secret, I see none. These are long-term averages of results from many receivers, not a rapid-feedback loop to help spammers improve their methods. On the contrary, I see a large advantage to publishing the data. This is what will motivate legitimate senders to block the zombies in their networks by publishing better authentication records.

When email recipients can see a direct comparison of spam ratings for comcast.net and aol.com, Comcast might just decide that publishing a strict authentication record would be in their best interest. They might lose some of their spamming customers, but so will every other large ISP that tolerates spammers. My guess is that they would welcome an opportunity to tell their spammers - "Hey guys, we have to do this. It's not our fault."

The private data advantage is reduced if your incoming mail flow is too
small.  Communications among a few peer systems can help greatly in this
case.

The main problem I see with private data is that it doesn't allow for the kind of rapid global communication needed to make spamming unprofitable. I realize that peers can be located anywhere in the world, so when I say "global", I don't mean geography. I mean no isolated "islands" of peers that can be attacked one at a time. If it takes even a few hours to downgrade a reputation, that will be plenty of time for spammers to inundate one island, then move on to the next.

If I understand the Gossip system, reputation information "diffuses" throughout the network of peers, one link at a time. How long will it take before the whole world knows that a particular domain has been taken over by spammers?

As an aside, Bayesian filters don't necessarily work better than
carefully maintained rule sets, but they do it with a fraction of the
maintenance.  Private reputation data created from your own mail flow
holds the same promise.

Bayesian filters, heuristic rulesets, IP blacklists, all are inferior to feedback from recipients. The trick is to make the amount of spam from whitelisted senders small enough that recipients don't mind having to report it. In the last three weeks, I've seen only 5 whitelisted spams in my inbox, 3 from google.com, and 2 from comcast.net. With only one or two spams a week, recipients won't mind dropping what they are doing, quickly reviewing the content of the message, and forwarding it to a spam-reporting address.

We can also make things nicer for recipients by sending them an immediate acknowledgement of their report, and a link to a website where they can see their report listed along with any others for the domain in question, the response of the domain postmaster, and any actions taken by the Rating Services watching this domain.

A few months ago, I saw a burst of spam from Yahoo's webmail servers lasting a few days. I expect they will be much quicker in shutting down these sources when they are prodded by our spam reports, and when all they have to do is publish one DNS record.

I don't mean to imply that there is no use for public reputation data.
Evaluating whether to use data from a particular source means knowing
who they are.  This exposes them to legal action, a risk most companies
do not want.  An alternative is creating composite data from all
submitters, which is the SpamCop approach that many sites find too
unreliable.  In the end, the most successful public lists are created
from networks of trusted private sources and are carefully managed.

Our ratings will come from many sources, including Gossip, if we can find a way to interface. The simplest system, which we are testing now on a small scale, simply takes an "average" of the SpamAssassin scores from many receivers over a long period of time, discarding the "outliers", which we define to be any source that attempts to move the average too much in either direction. This will eliminate the most obvious attack, sending huge volumes of phony mail to a collaborating recipient, so as to drive up the "ham" count.

While I hope that spammers will simply give up, and not force us to the next step, I am fully prepared for a battle of wits, as clever spammers try to fool equally clever managers at the Rating Services using our Registry. I'm working now on some Python scripts to display the data on a domain in a way that will allow managers to quickly spot anomalies. It should be very difficult for a spammer to generate a broad distribution of "ham" over a long period of time, enough to look like a normal legitimate sender.

I still don't understand the legal threat you keep referring to. There is no such thing as a "bad" reputation in our system. Ratings range from C (unknown) to A (less than one spam in 100 messages). We don't bother with lower ratings, because we assume that no spammer will continue to use a name with a rating lower than a fresh new "unknown" name. If a spammer is thwarted in an attempt to gain a higher reputation, who is he going to sue? What would be the allegation - "Spamhaus, you failed to give me the A-rating I deserve after 3 months of diligently faking a legitimate mail-flow?"

I don't see any threat from legitimate senders who lose their reputation through innocent mistakes. A well-managed rating service will work with a legitimate mailer to correct the mistake quickly. Let's say yahoo.com is doing quite well with their current default record, authorizing 84992 IP addresses. Suddenly spammers discover that they can forge Yahoo's name, at least on the zombies that lie within one of these huge IP blocks. What will Yahoo do, hire lawyers to sue rating services all over the world, or simply assert control of their Registry record, and de-authorize the zombies?

> I think the way to deal with threats of costly lawsuits is to set up
> the company in a jurisdiction with more common sense in their legal
> system than the USA.

This is the precise reason that U.S. companies will not likely make
their reputation data public.

The exception would be large companies like Ironport. No spammer would dare sue them. They do in fact, make their data public, just not in a way that can be automated without paying a fee. I expect that fees to rating services like Ironport will be the biggest cost in providing Registry services. That is as it should be. We need the best services in the world to provide the most reliable domain ratings. Everything else can be automated.

I believe the reason we don't have public reputation data is not fear of lawsuits, but rather a desire by companies to maintain a competitive advantage in selling their bundled products.

Even if there were no threat of lawsuits,
publishing this data tells your attackers how effective they were with
each spam run.

The data that is published is long-term averages of data from many sources. This will be very little value to the spammer. The only immediate feedback a spammer might see is an alert that goes out when a reputable domain is suddenly hijacked.

> If some rating service is put out of business by a lawsuit, others
> will take its place.

Even the threat of lawsuits is enough to deter most people.

The few Rating Services that are brave enough to not fear harassment in U.S. courts, will include the ones listed in our Registry records.

A much bigger worry regarding the reliability of Rating Services will be the possibility of bogus services controlled by spammers. Our strategy here is to pick the best services by allowing Registry subscribers to designate what fraction of their subscription fee goes to each Service. Corrupt or incompetent services will quickly lose their income, and eventually be dropped from the Registry.

-- Dave


-------
Sender Policy Framework: http://www.openspf.org/
Archives at http://archives.listbox.com/spf-discuss/current/
To unsubscribe, change your address, or temporarily deactivate your subscription, please go to http://v2.listbox.com/member/?list_id=735

<Prev in Thread] Current Thread [Next in Thread>