ietf-asrg
[Top] [All Lists]

Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - next steps (reflection)

2003-08-26 14:51:31

Since we don't yet have any ASRG data, I like to give some tidbits regarding some results I made on handling some ca 120.000 mails for my data file.

Since I do concentrate in finding valid contact info in the mails, I extract URL's and phone nos. In May I made some reference to some of my findings, but when adding ca 50.000 mails from spamarchive.org, to increase my data list last week, I experienced some unexpected results.

In short, in March/April I processed ca 5-6.000 mails I got myself, giving me some 3.000 uniq URL's (I only keep the least common denominator of a domains). In May I filtered some 50.000 mails from spamarchive.org to increase data, which gave me an additional ca 8.000 URL's/phone no:s and some time later another 15.000 from a dutch guy on a RIPE.net discussion. This added about another ca 5-6.000 URL's/nos.

These collections gave that ca 20 % of the letters was unknown addresses, so when I last week where to catch up with new domains, I again expected to get ca 20 % from the 50.000 new mails. But not so. After sorting and removing duplicates, I only got ca 2.500 new addresses, nearly 5.000 less than in May from the same amount of data.

Yeah, but in May all where new, now it is only 3-4 months later and a still much of those spams is around !

True, where Spamarchive gets their info can taint the result but after looking at it, my conclusion is that if we have a really large number of used domains/numbers, I would at least had ca half/two thirds the number to the May batch, for if we have a significantly large number of domains I still should have missed a lot in May, rising the last week figures. But it didn't.

At one occation here, during the summer, I said that even if the figures of spams counts millions, the number of operators/domains is far less. With some 120.000 mails I should have some statistical relevance and I still want to press on the fact that we probably only have some 30-50.000 active spam domains/call center numbers, making the blocking task easier, if we use the right data.

Another reflection, the ISP, for the account where I have the most problems, installed Spamassain in some bayisian mode during the Sobig attack last week. Due to the number of users and volume, it worked hard, getting visiable more effective during the weekend. But still, me simply grepping URL's from my block list, in incoming mails, still is ca 30% more effective than Spamassasin, Spamassasin also having more false positives. For those interested in the domain/nos data, you find it on :

http://hem.passagen.se/kmn_asrg/spamlist.txt.

Note, Sun's implementation(s) of grep have some funny behaviours, so I had to protect the dot in the urls. I also found out that grep is guilty of some of my false positives (1 in 200), since it matches domains in list with words in the mail, not being domains. Non of the (ef)grep alternatives helps and Sun is my only evironment, so if someone have a clue, I apressiate a hint. Also, Some addresses I never decoded, since my proof-of concept filter don't have any code filters, only my URL/phone extraction tool has that today.

Kurt Magnusson

_________________________________________________________________
Tired of spam? Get advanced junk mail protection with MSN 8. http://join.msn.com/?page=features/junkmail


_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>
  • Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - next steps (reflection), Kurt Magnusson <=