Re: [Asrg] 2.a.1 Analysis of Actual Spam Data

Since we don't yet have any ASRG data, I like to give some tidbits regardingsome results I made on handling some ca 120.000 mails for my data file.

Since I do concentrate in finding valid contact info in the mails, I extractURL's and phone nos. In May I made some reference to some of my findings,but when adding ca 50.000 mails from spamarchive.org, to increase my datalist last week, I experienced some unexpected results.

In short, in March/April I processed ca 5-6.000 mails I got myself, givingme some 3.000 uniq URL's (I only keep the least common denominator of adomains). In May I filtered some 50.000 mails from spamarchive.org toincrease data, which gave me an additional ca 8.000 URL's/phone no:s andsome time later another 15.000 from a dutch guy on a RIPE.net discussion.This added about another ca 5-6.000 URL's/nos.

These collections gave that ca 20 % of the letters was unknown addresses, sowhen I last week where to catch up with new domains, I again expected to getca 20 % from the 50.000 new mails. But not so. After sorting and removingduplicates, I only got ca 2.500 new addresses, nearly 5.000 less than in Mayfrom the same amount of data.

Yeah, but in May all where new, now it is only 3-4 months later and a stillmuch of those spams is around !

True, where Spamarchive gets their info can taint the result but afterlooking at it, my conclusion is that if we have a really large number ofused domains/numbers, I would at least had ca half/two thirds the number tothe May batch, for if we have a significantly large number of domains Istill should have missed a lot in May, rising the last week figures. But itdidn't.

At one occation here, during the summer, I said that even if the figures ofspams counts millions, the number of operators/domains is far less. Withsome 120.000 mails I should have some statistical relevance and I still wantto press on the fact that we probably only have some 30-50.000 active spamdomains/call center numbers, making the blocking task easier, if we use theright data.

Another reflection, the ISP, for the account where I have the most problems,installed Spamassain in some bayisian mode during the Sobig attack lastweek. Due to the number of users and volume, it worked hard, gettingvisiable more effective during the weekend. But still, me simply greppingURL's from my block list, in incoming mails, still is ca 30% more effectivethan Spamassasin, Spamassasin also having more false positives. For thoseinterested in the domain/nos data, you find it on :


http://hem.passagen.se/kmn_asrg/spamlist.txt.

Note, Sun's implementation(s) of grep have some funny behaviours, so I hadto protect the dot in the urls. I also found out that grep is guilty of someof my false positives (1 in 200), since it matches domains in list withwords in the mail, not being domains. Non of the (ef)grep alternatives helpsand Sun is my only evironment, so if someone have a clue, I apressiate ahint. Also, Some addresses I never decoded, since my proof-of concept filterdon't have any code filters, only my URL/phone extraction tool has thattoday.


Kurt Magnusson

_________________________________________________________________

Tired of spam? Get advanced junk mail protection with MSN 8.http://join.msn.com/?page=features/junkmail



_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg

Re: [Asrg] 2.a.1 Analysis of Actual Spam Data - next steps (reflection)