Re: [Asrg] Hotmail used for scripted spam (from SlashDot)
2003-06-11 12:37:04
Yakov Shafranovich wrote:
......
Note, that a CRI protocol will not help here since the HotMail C/R system
will authenticate the sender. Only active involvement by the ISP and
limiting amount of outgoing email will help. RMX will not resolve this
issue either.
.....
Hmm, very quiet on this one. Is we then back to square one? What will this
mean for other web mail vendors. We do have a number of Perl/php-scripts
out, to batch connect to retreive mail from Hotmail, Lycos and others,
modifyable to send instead. I then thought that C/R systems are a bit of a
colossus on clay feet, but most discussion during may/june seem to have been
about which C/R method is best, not that much about if it's practical or not
or simpler methods to C/R and RMX, that forces more us to do more drastic
changes of our infrastructure.
About two months ago, I presented a complementary methodology for spam
detection, that I call the "Earnest" method, because it is based on the only
earnest data that exists in the spams, URL's and phone numbers.
At that time Kee Hinckley had some serious views about my ideas and views,
so I spent some time to make a more thorough survey, increasing data from my
original 6.000 own spams with some 40.000 from the Spamhaus archives and
another 15.000 from a guy active in european RIPE's spam discussion list,
i.e. a total of 60.000 spams. Hopeful I have a more statistical significant
amount of spam to revisit my idea.
For the background, from my own spams, I extracted 1.500 unique URL's and
phone nos, from Spamhaus another 5.000 and from RIPE the last 2.500, on the
pattern base something.com, something.biz, something.co.tw and so on. After
sorting my own and Spamhaus together, I had 1.200 that was common, giving a
total of 5.300 unique entries. After adding the RIPE data, another 800 was
common, giving a total of ca 7.000 unique URL's and numbers. When visually
reviewing them, I also found that many where so alike in naming structure,
that they obviously is related and that the data probably represent no more
than 3-4.000 real web sites.
My conclusion is that even if we deal with 100:s of millions of spams, we,
as Spamhaus estimates with the numbers of real active spammers (+200), only
deal with a limited number of domains and phone numbers, maybe less than
100.000, so if we concentrate the effort, as I laid out in the Earnest
method presentation, to extract that _real_ data, it makes the filtering
task easier and faster and require less resources.
As I said last time, the key part is that if you send a spam, you, as a
spammer, need to make it easy for the user to go to the site, so you get
your dues, i.e. supply a working url or readable phone number. If you code
that url in anyway other than to code the whole page with base64 (a quit
possible spammer solution), as "%xx" or a "&#xxx;" web char code, you forces
your target to type in the whole URL, which leads to a drastic reduced hit
rate for the spammer's customer. Simplicity is the main driver behind spam.
If the process don't produce, the customers will not use spam, a simple
economic maxim IT people often tend to forget.
And as far as I found in my tests, you can't not "char" code neither "a
href" nor the "http://" statements of an URL, because they will then not
work. Therefore these parts will always be in clear text and grepable. If
those lines is decoded for any char encoding of the addresses and compared
in an case insensitive and free text search, they can't hide, if their URL
is there, you get a hit. And if they try to put in more URL's, they still
don't get away. They could try to disguise the address with a line break in
the address, but if the filter have a check for breaks before the closing
">" that can be handled. This simpler than analyzing the whole letter.
One of Kee's objections, where that it would be a resource demanding filter.
After studying the issue further, my only concern is a increased usage of
the mime coding, the tests I've done with sed based char decoding and nawk
"<!-" comments decoding filters have resulted minor resource demands, that
most, except the large ISP sites, will have no problem with. Comment coding
is mainly a problem for phone number matching, not for URL's, they breaks.
Now neither sed nor nawk is the best tools for this, so result should be
dramaticlly improved by using a purpose designed filter, than my present
solution, that just is a shell script test filter.
But how does my list of 7.000 URL and numbers work? Well, I have today
relegated my old mail-acount to a test account, protected by a basic shell
script, mail filtered and matched to the list by an simple grep command. The
list also include the names/IP-nos of some 2.000 open mail relays. I
yesterday received the first two spams into the box for 2 months (it get ca
10 spams a day). I also send over my present mailbox to it, to check for
false positives and after a clean out of some 1000 old mail relays mid-May,
I had no false positive at all. Based on that the test script yet doesn't
have any more advanced feature as mime decoding, it greps on whole the text,
not looking for broken lines in the URL's and so on, I regards this a fairly
good result.
As a example, when I went from my 1.500 URL's and 3.000 relays to the
5.300+2.000 including Spamhaus, I went from 1 spam in of 10 delivered, to 1
spam in 300.
Another issue Kee brought up, that I had no good solution to then, where how
to keep the target list up to date ? There is a number of actions needed to
handle this issue:
1. There always will be an amount of spams, directly to a targeted number
domains. If there internally exist a mailbox for forwarding of spams (larger
corporations/sites might want a number of addresses to not make the address
to widely known), that have a spam processor, extracting the URL's and phone
numbers. The local users can forward their spam here, creating a locally
adapted blocking. As my own results shows, this method gets most of
addresses that I will encounter over a longer period, 9 of 10. (OK, this
part might be a problem, due to Jacob Nielsens patent).
2. Due to the size of the problem, it could be regarded prudent that US FTC
and it's EU counterpart do run honeypots, where spam is collected in the
same way as for the local spambox, to make official lists, that can be
merged with local lists. With a large number of addresses over a number of
domains (same end mailbox though), addresses distributed on "fake" web pages
and in Usenet, these honeypots should get a fast momentum, producing lists
that from beginning catches ca 80 % of the spams. With the local lists, I
estimate that these two would accounts for blocking over 95 % of the
possible spams to a individual site, within a couple of weeks.
3. Anyone interested, could set up their own honeypots, publishing their own
lists. These will increase hits further. Oh, these could be tainted, if
discovered, yes. That was another of Kee's original objections, but my
answer is: what is the chance that a spammer would include a URL that my
Aunt Agatha ever would think of sending me? Probably nil. They can't win
this one, since it only affect mail, with just those URL's/numbers, never
the web. If we knows that msn.com always get filtered out due to the
spammers, well, why send that address as a link? A nuisance, yes, but we
have adopted short codes for other events. The only thing that the spammer
do achieve, is that some lists will get much bigger than the rest and
thereby get suspicious.
4. With a baysian filter on the URL's, a filter could learn to catch related
new URL's without updating the spamlist. Due to the lesser amount of data to
analyze it is faster. With so many related sites, this would help a lot.
5. When we locally joins the existing list with an external master, it is
simply a "sort | uniq" command. A cron job that collects the "master" file
with some of the batch web "browsers" and does the joining. Upgrading data
is therefor simple.
6. If you know that you absolutely have no asian users, make the spamfilter
work with chars as "訝皹歙". If it finds a number of them (look at
combinations typical for your asian spammer) and you have a 100% success
rate. I haven't seen a asian mail in my testbox since February, when I
remade my test filter to dump such letters, though I have a 40% average of
them in the incoming spam.
7. If you don't accept that someone does block a certain spammer, you simply
remove the entry from the list (OK, only on a site basis for simplicity).
8. It is up to the taste of the MTA/UA designers if they just want to
through away the spam or put it in a separate mailbox (IMAP/Maildir?) for
visual inspection of both false positive and interesting spams, allow for
spam with the "ADV:"-flag in subject or what ever. And it is up to us to
chose MTA/UA. Keep a time limit on spam, maybe not older than, say, 2 weeks.
I'll run my test this way today. And no-one for spammers to sue (a tell tail
sign that they are pushed against the rail), because my company/organization
do have an exclusive right to decide what our resources is used for, not the
spammers. I decide how we use a retreived spamlist, not they who collected
it.
The result should also be, that as soon the customer to a spammer moves to a
new phone number or URL, to get away from a blocking, that one gets
immediately tainted due to the spammer's own actions. In the end it makes
the process not economical viable, changing URL's or numbers (i.e. call
center operator) for every spam, particular as the process does not care
about sub-domains, which is the most common way of changing URL today. 10-15
subs to a main domain for some. Before I filtered those out I had some
15.000 unique URL's.
As I said before, the most visible threat I see is that they all go over to
base64 encoded mails, but then I tell my counterparts, "do not send me
base64 mails, because I'll dump them. Use ftp or a web address for me to
retrieve any attachment". So that is also a possible filter tactic. On the
other hand, someone on the list with knowledge and time might be able to
make a simple and fast streaming filter that works better far better than
deview or metamail in a spam filter environment. Another threat is pure
huliganism, there baysian filters is the only option.
Kurt Magnusson
_________________________________________________________________
Help STOP SPAM with the new MSN 8 and get 2 months FREE*
http://join.msn.com/?page=features/junkmail
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg
|
|