Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - next steps (reflection)
2003-08-26 14:51:31
Since we don't yet have any ASRG data, I like to give some tidbits regarding
some results I made on handling some ca 120.000 mails for my data file.
Since I do concentrate in finding valid contact info in the mails, I extract
URL's and phone nos. In May I made some reference to some of my findings,
but when adding ca 50.000 mails from spamarchive.org, to increase my data
list last week, I experienced some unexpected results.
In short, in March/April I processed ca 5-6.000 mails I got myself, giving
me some 3.000 uniq URL's (I only keep the least common denominator of a
domains). In May I filtered some 50.000 mails from spamarchive.org to
increase data, which gave me an additional ca 8.000 URL's/phone no:s and
some time later another 15.000 from a dutch guy on a RIPE.net discussion.
This added about another ca 5-6.000 URL's/nos.
These collections gave that ca 20 % of the letters was unknown addresses, so
when I last week where to catch up with new domains, I again expected to get
ca 20 % from the 50.000 new mails. But not so. After sorting and removing
duplicates, I only got ca 2.500 new addresses, nearly 5.000 less than in May
from the same amount of data.
Yeah, but in May all where new, now it is only 3-4 months later and a still
much of those spams is around !
True, where Spamarchive gets their info can taint the result but after
looking at it, my conclusion is that if we have a really large number of
used domains/numbers, I would at least had ca half/two thirds the number to
the May batch, for if we have a significantly large number of domains I
still should have missed a lot in May, rising the last week figures. But it
didn't.
At one occation here, during the summer, I said that even if the figures of
spams counts millions, the number of operators/domains is far less. With
some 120.000 mails I should have some statistical relevance and I still want
to press on the fact that we probably only have some 30-50.000 active spam
domains/call center numbers, making the blocking task easier, if we use the
right data.
Another reflection, the ISP, for the account where I have the most problems,
installed Spamassain in some bayisian mode during the Sobig attack last
week. Due to the number of users and volume, it worked hard, getting
visiable more effective during the weekend. But still, me simply grepping
URL's from my block list, in incoming mails, still is ca 30% more effective
than Spamassasin, Spamassasin also having more false positives. For those
interested in the domain/nos data, you find it on :
http://hem.passagen.se/kmn_asrg/spamlist.txt.
Note, Sun's implementation(s) of grep have some funny behaviours, so I had
to protect the dot in the urls. I also found out that grep is guilty of some
of my false positives (1 in 200), since it matches domains in list with
words in the mail, not being domains. Non of the (ef)grep alternatives helps
and Sun is my only evironment, so if someone have a clue, I apressiate a
hint. Also, Some addresses I never decoded, since my proof-of concept filter
don't have any code filters, only my URL/phone extraction tool has that
today.
Kurt Magnusson
_________________________________________________________________
Tired of spam? Get advanced junk mail protection with MSN 8.
http://join.msn.com/?page=features/junkmail
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg
<Prev in Thread] |
Current Thread |
[Next in Thread> |
- Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - next steps (reflection),
Kurt Magnusson <=
|
|
|