ietf-asrg
[Top] [All Lists]

Re: [Asrg] Hotmail used for scripted spam (from SlashDot)

2003-06-11 12:37:04
Yakov Shafranovich wrote:
......
Note, that a CRI protocol will not help here since the HotMail C/R system will authenticate the sender. Only active involvement by the ISP and limiting amount of outgoing email will help. RMX will not resolve this issue either.
.....

Hmm, very quiet on this one. Is we then back to square one? What will this mean for other web mail vendors. We do have a number of Perl/php-scripts out, to batch connect to retreive mail from Hotmail, Lycos and others, modifyable to send instead. I then thought that C/R systems are a bit of a colossus on clay feet, but most discussion during may/june seem to have been about which C/R method is best, not that much about if it's practical or not or simpler methods to C/R and RMX, that forces more us to do more drastic changes of our infrastructure.

About two months ago, I presented a complementary methodology for spam detection, that I call the "Earnest" method, because it is based on the only earnest data that exists in the spams, URL's and phone numbers.

At that time Kee Hinckley had some serious views about my ideas and views, so I spent some time to make a more thorough survey, increasing data from my original 6.000 own spams with some 40.000 from the Spamhaus archives and another 15.000 from a guy active in european RIPE's spam discussion list, i.e. a total of 60.000 spams. Hopeful I have a more statistical significant amount of spam to revisit my idea.

For the background, from my own spams, I extracted 1.500 unique URL's and phone nos, from Spamhaus another 5.000 and from RIPE the last 2.500, on the pattern base something.com, something.biz, something.co.tw and so on. After sorting my own and Spamhaus together, I had 1.200 that was common, giving a total of 5.300 unique entries. After adding the RIPE data, another 800 was common, giving a total of ca 7.000 unique URL's and numbers. When visually reviewing them, I also found that many where so alike in naming structure, that they obviously is related and that the data probably represent no more than 3-4.000 real web sites.

My conclusion is that even if we deal with 100:s of millions of spams, we, as Spamhaus estimates with the numbers of real active spammers (+200), only deal with a limited number of domains and phone numbers, maybe less than 100.000, so if we concentrate the effort, as I laid out in the Earnest method presentation, to extract that _real_ data, it makes the filtering task easier and faster and require less resources.

As I said last time, the key part is that if you send a spam, you, as a spammer, need to make it easy for the user to go to the site, so you get your dues, i.e. supply a working url or readable phone number. If you code that url in anyway other than to code the whole page with base64 (a quit possible spammer solution), as "%xx" or a "&#xxx;" web char code, you forces your target to type in the whole URL, which leads to a drastic reduced hit rate for the spammer's customer. Simplicity is the main driver behind spam. If the process don't produce, the customers will not use spam, a simple economic maxim IT people often tend to forget.

And as far as I found in my tests, you can't not "char" code neither "a href" nor the "http://"; statements of an URL, because they will then not work. Therefore these parts will always be in clear text and grepable. If those lines is decoded for any char encoding of the addresses and compared in an case insensitive and free text search, they can't hide, if their URL is there, you get a hit. And if they try to put in more URL's, they still don't get away. They could try to disguise the address with a line break in the address, but if the filter have a check for breaks before the closing ">" that can be handled. This simpler than analyzing the whole letter.

One of Kee's objections, where that it would be a resource demanding filter. After studying the issue further, my only concern is a increased usage of the mime coding, the tests I've done with sed based char decoding and nawk "<!-" comments decoding filters have resulted minor resource demands, that most, except the large ISP sites, will have no problem with. Comment coding is mainly a problem for phone number matching, not for URL's, they breaks. Now neither sed nor nawk is the best tools for this, so result should be dramaticlly improved by using a purpose designed filter, than my present solution, that just is a shell script test filter.

But how does my list of 7.000 URL and numbers work? Well, I have today relegated my old mail-acount to a test account, protected by a basic shell script, mail filtered and matched to the list by an simple grep command. The list also include the names/IP-nos of some 2.000 open mail relays. I yesterday received the first two spams into the box for 2 months (it get ca 10 spams a day). I also send over my present mailbox to it, to check for false positives and after a clean out of some 1000 old mail relays mid-May, I had no false positive at all. Based on that the test script yet doesn't have any more advanced feature as mime decoding, it greps on whole the text, not looking for broken lines in the URL's and so on, I regards this a fairly good result.

As a example, when I went from my 1.500 URL's and 3.000 relays to the 5.300+2.000 including Spamhaus, I went from 1 spam in of 10 delivered, to 1 spam in 300.

Another issue Kee brought up, that I had no good solution to then, where how to keep the target list up to date ? There is a number of actions needed to handle this issue:

1. There always will be an amount of spams, directly to a targeted number domains. If there internally exist a mailbox for forwarding of spams (larger corporations/sites might want a number of addresses to not make the address to widely known), that have a spam processor, extracting the URL's and phone numbers. The local users can forward their spam here, creating a locally adapted blocking. As my own results shows, this method gets most of addresses that I will encounter over a longer period, 9 of 10. (OK, this part might be a problem, due to Jacob Nielsens patent).

2. Due to the size of the problem, it could be regarded prudent that US FTC and it's EU counterpart do run honeypots, where spam is collected in the same way as for the local spambox, to make official lists, that can be merged with local lists. With a large number of addresses over a number of domains (same end mailbox though), addresses distributed on "fake" web pages and in Usenet, these honeypots should get a fast momentum, producing lists that from beginning catches ca 80 % of the spams. With the local lists, I estimate that these two would accounts for blocking over 95 % of the possible spams to a individual site, within a couple of weeks.

3. Anyone interested, could set up their own honeypots, publishing their own lists. These will increase hits further. Oh, these could be tainted, if discovered, yes. That was another of Kee's original objections, but my answer is: what is the chance that a spammer would include a URL that my Aunt Agatha ever would think of sending me? Probably nil. They can't win this one, since it only affect mail, with just those URL's/numbers, never the web. If we knows that msn.com always get filtered out due to the spammers, well, why send that address as a link? A nuisance, yes, but we have adopted short codes for other events. The only thing that the spammer do achieve, is that some lists will get much bigger than the rest and thereby get suspicious.

4. With a baysian filter on the URL's, a filter could learn to catch related new URL's without updating the spamlist. Due to the lesser amount of data to analyze it is faster. With so many related sites, this would help a lot.

5. When we locally joins the existing list with an external master, it is simply a "sort | uniq" command. A cron job that collects the "master" file with some of the batch web "browsers" and does the joining. Upgrading data is therefor simple.

6. If you know that you absolutely have no asian users, make the spamfilter work with chars as "訝皹歙". If it finds a number of them (look at combinations typical for your asian spammer) and you have a 100% success rate. I haven't seen a asian mail in my testbox since February, when I remade my test filter to dump such letters, though I have a 40% average of them in the incoming spam.

7. If you don't accept that someone does block a certain spammer, you simply remove the entry from the list (OK, only on a site basis for simplicity).

8. It is up to the taste of the MTA/UA designers if they just want to through away the spam or put it in a separate mailbox (IMAP/Maildir?) for visual inspection of both false positive and interesting spams, allow for spam with the "ADV:"-flag in subject or what ever. And it is up to us to chose MTA/UA. Keep a time limit on spam, maybe not older than, say, 2 weeks. I'll run my test this way today. And no-one for spammers to sue (a tell tail sign that they are pushed against the rail), because my company/organization do have an exclusive right to decide what our resources is used for, not the spammers. I decide how we use a retreived spamlist, not they who collected it.

The result should also be, that as soon the customer to a spammer moves to a new phone number or URL, to get away from a blocking, that one gets immediately tainted due to the spammer's own actions. In the end it makes the process not economical viable, changing URL's or numbers (i.e. call center operator) for every spam, particular as the process does not care about sub-domains, which is the most common way of changing URL today. 10-15 subs to a main domain for some. Before I filtered those out I had some 15.000 unique URL's.

As I said before, the most visible threat I see is that they all go over to base64 encoded mails, but then I tell my counterparts, "do not send me base64 mails, because I'll dump them. Use ftp or a web address for me to retrieve any attachment". So that is also a possible filter tactic. On the other hand, someone on the list with knowledge and time might be able to make a simple and fast streaming filter that works better far better than deview or metamail in a spam filter environment. Another threat is pure huliganism, there baysian filters is the only option.

Kurt Magnusson

_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE* http://join.msn.com/?page=features/junkmail

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>