Re: [Asrg] Hotmail used for scripted spam (from SlashDot)

Yakov Shafranovich wrote:
......

Note, that a CRI protocol will not help here since the HotMail C/R systemwill authenticate the sender. Only active involvement by the ISP andlimiting amount of outgoing email will help. RMX will not resolve thisissue either.

.....

Hmm, very quiet on this one. Is we then back to square one? What will thismean for other web mail vendors. We do have a number of Perl/php-scriptsout, to batch connect to retreive mail from Hotmail, Lycos and others,modifyable to send instead. I then thought that C/R systems are a bit of acolossus on clay feet, but most discussion during may/june seem to have beenabout which C/R method is best, not that much about if it's practical or notor simpler methods to C/R and RMX, that forces more us to do more drasticchanges of our infrastructure.

About two months ago, I presented a complementary methodology for spamdetection, that I call the "Earnest" method, because it is based on the onlyearnest data that exists in the spams, URL's and phone numbers.

At that time Kee Hinckley had some serious views about my ideas and views,so I spent some time to make a more thorough survey, increasing data from myoriginal 6.000 own spams with some 40.000 from the Spamhaus archives andanother 15.000 from a guy active in european RIPE's spam discussion list,i.e. a total of 60.000 spams. Hopeful I have a more statistical significantamount of spam to revisit my idea.

For the background, from my own spams, I extracted 1.500 unique URL's andphone nos, from Spamhaus another 5.000 and from RIPE the last 2.500, on thepattern base something.com, something.biz, something.co.tw and so on. Aftersorting my own and Spamhaus together, I had 1.200 that was common, giving atotal of 5.300 unique entries. After adding the RIPE data, another 800 wascommon, giving a total of ca 7.000 unique URL's and numbers. When visuallyreviewing them, I also found that many where so alike in naming structure,that they obviously is related and that the data probably represent no morethan 3-4.000 real web sites.

My conclusion is that even if we deal with 100:s of millions of spams, we,as Spamhaus estimates with the numbers of real active spammers (+200), onlydeal with a limited number of domains and phone numbers, maybe less than100.000, so if we concentrate the effort, as I laid out in the Earnestmethod presentation, to extract that _real_ data, it makes the filteringtask easier and faster and require less resources.

As I said last time, the key part is that if you send a spam, you, as aspammer, need to make it easy for the user to go to the site, so you getyour dues, i.e. supply a working url or readable phone number. If you codethat url in anyway other than to code the whole page with base64 (a quitpossible spammer solution), as "%xx" or a "&#xxx;" web char code, you forcesyour target to type in the whole URL, which leads to a drastic reduced hitrate for the spammer's customer. Simplicity is the main driver behind spam.If the process don't produce, the customers will not use spam, a simpleeconomic maxim IT people often tend to forget.

And as far as I found in my tests, you can't not "char" code neither "ahref" nor the "http://"; statements of an URL, because they will then notwork. Therefore these parts will always be in clear text and grepable. Ifthose lines is decoded for any char encoding of the addresses and comparedin an case insensitive and free text search, they can't hide, if their URLis there, you get a hit. And if they try to put in more URL's, they stilldon't get away. They could try to disguise the address with a line break inthe address, but if the filter have a check for breaks before the closing">" that can be handled. This simpler than analyzing the whole letter.

One of Kee's objections, where that it would be a resource demanding filter.After studying the issue further, my only concern is a increased usage ofthe mime coding, the tests I've done with sed based char decoding and nawk"<!-" comments decoding filters have resulted minor resource demands, thatmost, except the large ISP sites, will have no problem with. Comment codingis mainly a problem for phone number matching, not for URL's, they breaks.Now neither sed nor nawk is the best tools for this, so result should bedramaticlly improved by using a purpose designed filter, than my presentsolution, that just is a shell script test filter.

But how does my list of 7.000 URL and numbers work? Well, I have todayrelegated my old mail-acount to a test account, protected by a basic shellscript, mail filtered and matched to the list by an simple grep command. Thelist also include the names/IP-nos of some 2.000 open mail relays. Iyesterday received the first two spams into the box for 2 months (it get ca10 spams a day). I also send over my present mailbox to it, to check forfalse positives and after a clean out of some 1000 old mail relays mid-May,I had no false positive at all. Based on that the test script yet doesn'thave any more advanced feature as mime decoding, it greps on whole the text,not looking for broken lines in the URL's and so on, I regards this a fairlygood result.

As a example, when I went from my 1.500 URL's and 3.000 relays to the5.300+2.000 including Spamhaus, I went from 1 spam in of 10 delivered, to 1spam in 300.

Another issue Kee brought up, that I had no good solution to then, where howto keep the target list up to date ? There is a number of actions needed tohandle this issue:

1. There always will be an amount of spams, directly to a targeted numberdomains. If there internally exist a mailbox for forwarding of spams (largercorporations/sites might want a number of addresses to not make the addressto widely known), that have a spam processor, extracting the URL's and phonenumbers. The local users can forward their spam here, creating a locallyadapted blocking. As my own results shows, this method gets most ofaddresses that I will encounter over a longer period, 9 of 10. (OK, thispart might be a problem, due to Jacob Nielsens patent).

2. Due to the size of the problem, it could be regarded prudent that US FTCand it's EU counterpart do run honeypots, where spam is collected in thesame way as for the local spambox, to make official lists, that can bemerged with local lists. With a large number of addresses over a number ofdomains (same end mailbox though), addresses distributed on "fake" web pagesand in Usenet, these honeypots should get a fast momentum, producing liststhat from beginning catches ca 80 % of the spams. With the local lists, Iestimate that these two would accounts for blocking over 95 % of thepossible spams to a individual site, within a couple of weeks.

3. Anyone interested, could set up their own honeypots, publishing their ownlists. These will increase hits further. Oh, these could be tainted, ifdiscovered, yes. That was another of Kee's original objections, but myanswer is: what is the chance that a spammer would include a URL that myAunt Agatha ever would think of sending me? Probably nil. They can't winthis one, since it only affect mail, with just those URL's/numbers, neverthe web. If we knows that msn.com always get filtered out due to thespammers, well, why send that address as a link? A nuisance, yes, but wehave adopted short codes for other events. The only thing that the spammerdo achieve, is that some lists will get much bigger than the rest andthereby get suspicious.

4. With a baysian filter on the URL's, a filter could learn to catch relatednew URL's without updating the spamlist. Due to the lesser amount of data toanalyze it is faster. With so many related sites, this would help a lot.

5. When we locally joins the existing list with an external master, it issimply a "sort | uniq" command. A cron job that collects the "master" filewith some of the batch web "browsers" and does the joining. Upgrading datais therefor simple.

6. If you know that you absolutely have no asian users, make the spamfilterwork with chars as "訝皹歙". If it finds a number of them (look atcombinations typical for your asian spammer) and you have a 100% successrate. I haven't seen a asian mail in my testbox since February, when Iremade my test filter to dump such letters, though I have a 40% average ofthem in the incoming spam.

7. If you don't accept that someone does block a certain spammer, you simplyremove the entry from the list (OK, only on a site basis for simplicity).

8. It is up to the taste of the MTA/UA designers if they just want tothrough away the spam or put it in a separate mailbox (IMAP/Maildir?) forvisual inspection of both false positive and interesting spams, allow forspam with the "ADV:"-flag in subject or what ever. And it is up to us tochose MTA/UA. Keep a time limit on spam, maybe not older than, say, 2 weeks.I'll run my test this way today. And no-one for spammers to sue (a tell tailsign that they are pushed against the rail), because my company/organizationdo have an exclusive right to decide what our resources is used for, not thespammers. I decide how we use a retreived spamlist, not they who collectedit.

The result should also be, that as soon the customer to a spammer moves to anew phone number or URL, to get away from a blocking, that one getsimmediately tainted due to the spammer's own actions. In the end it makesthe process not economical viable, changing URL's or numbers (i.e. callcenter operator) for every spam, particular as the process does not careabout sub-domains, which is the most common way of changing URL today. 10-15subs to a main domain for some. Before I filtered those out I had some15.000 unique URL's.

As I said before, the most visible threat I see is that they all go over tobase64 encoded mails, but then I tell my counterparts, "do not send mebase64 mails, because I'll dump them. Use ftp or a web address for me toretrieve any attachment". So that is also a possible filter tactic. On theother hand, someone on the list with knowledge and time might be able tomake a simple and fast streaming filter that works better far better thandeview or metamail in a spam filter environment. Another threat is purehuliganism, there baysian filters is the only option.


Kurt Magnusson

_________________________________________________________________

Help STOP SPAM with the new MSN 8 and get 2 months FREE*http://join.msn.com/?page=features/junkmail


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg