RE: [spf-discuss] Making spam scores public

At 11:42 PM 3/6/2007 -0600, Seth Goodman wrote:

David MacQuigg wrote on Monday, March 05, 2007 4:27 PM -0600:

> It seems to me the real goal of our work on these reputation systems
> is to provide a universal solution to the spam problem.

While laudable, this is not possible.


Nattering Nabob of Negativism!!! :>)
http://en.wikipedia.org/wiki/Spiro_Agnew

> Keeping the data private means there is no fundamental difference
> between what you are doing and what any large ISP or spam-appliance
> company does.  How can you expect your solution to be any better
> than what these private companies are doing?

Having private data is a large advantage, which is why large ISP's don't
publish their internal listing criteria.  Attacking a small network that
uses local reputation data followed by Bayesian content filtering is
inherently harder than attacking a system using public DNSBL's and
content filters with public rule sets.  You can get some unwanted
messages through, but you can't test the messages for deliverability
ahead of time.  The only commonality among networks that use local data
is the code that generates it.  The data itself, and the system
parameters that drive the decisions, is all unknown to attackers.

I think we need to make a distinction between private reputation data andprivate rule sets or methods. I can see your point if we are talking aboutrule sets and methods, and I am inclined to agree, but not entirelyconvinced. The SpamAssassin folks argue that making their rule sets publicis not a problem. If I understand their argument, it is that their rulesets are so large and hard to work around that it takes spammers months toadapt, and by that time, the next update on their rule set is available. Iwish I had a link for you, but I do recall seeing a graph of spam vs timesupporting this argument. Whatever the conclusion, it doesn't reallymatter, because we use SpamAssassin only to process messages that are notwhitelisted, and as a means of generating reputation scores for domainsthat have not yet qualified to be whitelisted. SpamAssassin scores areplenty accurate for that purpose.

If spammers get very good at avoiding SpamAssassin's rule sets, we canswitch to another filter, or even use different filters for differentreceivers, and keep the choices secret. As long as we get some feedback onthe filter's decisions, we can include this feedback in an overall averagerating for a domain.

As to the tactical advantage of keeping the reputation data secret, I seenone. These are long-term averages of results from many receivers, not arapid-feedback loop to help spammers improve their methods. On thecontrary, I see a large advantage to publishing the data. This is whatwill motivate legitimate senders to block the zombies in their networks bypublishing better authentication records.

When email recipients can see a direct comparison of spam ratings forcomcast.net and aol.com, Comcast might just decide that publishing a strictauthentication record would be in their best interest. They might losesome of their spamming customers, but so will every other large ISP thattolerates spammers. My guess is that they would welcome an opportunity totell their spammers - "Hey guys, we have to do this. It's not our fault."

The private data advantage is reduced if your incoming mail flow is too
small.  Communications among a few peer systems can help greatly in this
case.

The main problem I see with private data is that it doesn't allow for thekind of rapid global communication needed to make spamming unprofitable. Irealize that peers can be located anywhere in the world, so when I say"global", I don't mean geography. I mean no isolated "islands" of peersthat can be attacked one at a time. If it takes even a few hours todowngrade a reputation, that will be plenty of time for spammers toinundate one island, then move on to the next.

If I understand the Gossip system, reputation information "diffuses"throughout the network of peers, one link at a time. How long will it takebefore the whole world knows that a particular domain has been taken overby spammers?

As an aside, Bayesian filters don't necessarily work better than
carefully maintained rule sets, but they do it with a fraction of the
maintenance.  Private reputation data created from your own mail flow
holds the same promise.

Bayesian filters, heuristic rulesets, IP blacklists, all are inferior tofeedback from recipients. The trick is to make the amount of spam fromwhitelisted senders small enough that recipients don't mind having toreport it. In the last three weeks, I've seen only 5 whitelisted spams inmy inbox, 3 from google.com, and 2 from comcast.net. With only one or twospams a week, recipients won't mind dropping what they are doing, quicklyreviewing the content of the message, and forwarding it to a spam-reportingaddress.

We can also make things nicer for recipients by sending them an immediateacknowledgement of their report, and a link to a website where they can seetheir report listed along with any others for the domain in question, theresponse of the domain postmaster, and any actions taken by the RatingServices watching this domain.

A few months ago, I saw a burst of spam from Yahoo's webmail serverslasting a few days. I expect they will be much quicker in shutting downthese sources when they are prodded by our spam reports, and when all theyhave to do is publish one DNS record.

I don't mean to imply that there is no use for public reputation data.
Evaluating whether to use data from a particular source means knowing
who they are.  This exposes them to legal action, a risk most companies
do not want.  An alternative is creating composite data from all
submitters, which is the SpamCop approach that many sites find too
unreliable.  In the end, the most successful public lists are created
from networks of trusted private sources and are carefully managed.

Our ratings will come from many sources, including Gossip, if we can find away to interface. The simplest system, which we are testing now on a smallscale, simply takes an "average" of the SpamAssassin scores from manyreceivers over a long period of time, discarding the "outliers", which wedefine to be any source that attempts to move the average too much ineither direction. This will eliminate the most obvious attack, sendinghuge volumes of phony mail to a collaborating recipient, so as to drive upthe "ham" count.

While I hope that spammers will simply give up, and not force us to thenext step, I am fully prepared for a battle of wits, as clever spammers tryto fool equally clever managers at the Rating Services using ourRegistry. I'm working now on some Python scripts to display the data on adomain in a way that will allow managers to quickly spot anomalies. Itshould be very difficult for a spammer to generate a broad distribution of"ham" over a long period of time, enough to look like a normal legitimatesender.

I still don't understand the legal threat you keep referring to. There isno such thing as a "bad" reputation in our system. Ratings range from C(unknown) to A (less than one spam in 100 messages). We don't bother withlower ratings, because we assume that no spammer will continue to use aname with a rating lower than a fresh new "unknown" name. If a spammer isthwarted in an attempt to gain a higher reputation, who is he going tosue? What would be the allegation - "Spamhaus, you failed to give me theA-rating I deserve after 3 months of diligently faking a legitimate mail-flow?"

I don't see any threat from legitimate senders who lose their reputationthrough innocent mistakes. A well-managed rating service will work with alegitimate mailer to correct the mistake quickly. Let's say yahoo.com isdoing quite well with their current default record, authorizing 84992 IPaddresses. Suddenly spammers discover that they can forge Yahoo's name, atleast on the zombies that lie within one of these huge IP blocks. Whatwill Yahoo do, hire lawyers to sue rating services all over the world, orsimply assert control of their Registry record, and de-authorize the zombies?

> I think the way to deal with threats of costly lawsuits is to set up
> the company in a jurisdiction with more common sense in their legal
> system than the USA.

This is the precise reason that U.S. companies will not likely make
their reputation data public.

The exception would be large companies like Ironport. No spammer woulddare sue them. They do in fact, make their data public, just not in a waythat can be automated without paying a fee. I expect that fees to ratingservices like Ironport will be the biggest cost in providing Registryservices. That is as it should be. We need the best services in the worldto provide the most reliable domain ratings. Everything else can be automated.

I believe the reason we don't have public reputation data is not fear oflawsuits, but rather a desire by companies to maintain a competitiveadvantage in selling their bundled products.

Even if there were no threat of lawsuits,
publishing this data tells your attackers how effective they were with
each spam run.

The data that is published is long-term averages of data from manysources. This will be very little value to the spammer. The onlyimmediate feedback a spammer might see is an alert that goes out when areputable domain is suddenly hijacked.

> If some rating service is put out of business by a lawsuit, others
> will take its place.

Even the threat of lawsuits is enough to deter most people.

The few Rating Services that are brave enough to not fear harassment inU.S. courts, will include the ones listed in our Registry records.

A much bigger worry regarding the reliability of Rating Services will bethe possibility of bogus services controlled by spammers. Our strategyhere is to pick the best services by allowing Registry subscribers todesignate what fraction of their subscription fee goes to eachService. Corrupt or incompetent services will quickly lose their income,and eventually be dropped from the Registry.


-- Dave


-------
Sender Policy Framework: http://www.openspf.org/
Archives at http://archives.listbox.com/spf-discuss/current/

To unsubscribe, change your address, or temporarily deactivate your subscription,please go to http://v2.listbox.com/member/?list_id=735