ietf-asrg
[Top] [All Lists]

[Asrg] "uncaught spam" follow-up

2011-02-22 13:52:55
Almost a year ago, I posted to the group about a little research project I 
intended to set up, where I would compare spam that is easy to filter, i.e. 
(almost) all spam-filters I use would block it, to spam that is missed by some 
products. [1]

I've finally found the time do have a little go at this. During a few weeks in 
November, I looked at the geographical distribution of sending IP addresses in 
my total spam corpus and compared that with the sub-corpus of 'difficult' spam. 
The latter is defined as spam that is missed by at least two[2] filters, 
whereas I used a geoip-lookup to determine the country of origin of a spam 
message.

It resulted in the following distributions:

All spam
1       Russian Federation      9.86%
2       India                           7.67%
3       United States           7.35%
4       Brazil                  5.86%
5       Vietnam                 5.70%
6       Ukraine                 4.84%
7       United Kingdom          3.94%
8       South Korea                     2.92%
9       Italy                           2.78%
10      Indonesia                       2.54%

Difficult spam
1       Russian Federation      11.09%
2       India                   8.45%
3       Vietnam                 5.22%
4       South Korea                     4.94%
5       Ukraine                 3.83%
        Brazil                  3.83%
7       Indonesia                       3.55%
8       China                           3.52%
9       United States           3.41%
10      United Kingdom          3.26%

Then, during a few weeks in December, I looked at the distribution of 
MIME-types among two similarly defined corpora.

All spam
Text and HTML   34.4%
Text                    31.9%
HTML                    30.8%
Image[3]                1.6%
Other[4]                1.3%

Difficult spam
Text                    52.3%
Text and HTML   26.9%
HTML                    14.4%
Other                   5.1%
Image                   1.3%

Now I am the first to answer the question 'what does this mean?' with 'not 
much', but I wanted to share the results anyway.

My gut feeling is that the fact that plain text messages are 'harder to filter' 
has little to do with the fact that they are written in plain text; rather, 
that it is a consequence of a certain spam campaign that for some reason gives 
filters more problems and just happens to send its messages in plain text.

The fact that spam from some several Asian and other non-Western countries is 
'harder to filter' may have to do with the fact that most filters are developed 
in Western countries and that communication between local/regional ISPs on 
which IP addresses to block may be easier. However, again that is only a guess.

I intend to continue to run tests like this. Suggestions, both on what to test 
as well as on how to interpret the results, are more than welcome.

Martijn.

[1] http://www.ietf.org/mail-archive/web/asrg/current/msg16306.html

[2] to avoid the results being skewed by one of the (20) filters having some 
kind of issue; I am aware that spam-filters may use the same/similar 
technologies so if one such technology (e.g. a blacklist) had an 'issue' this 
would still have skewed the results.

[3] messages with one or more embedded images, regardless of other MIME types 
present.

[4] including DSNs, attached ZIPs/PDFs/Docs, apparently broken MIME.

Martijn Grooten, Anti-spam Test Director, www.virusbtn.com
Tel: +44 1235 540235 / +44 7872 674989
VB2011: 5-7 October 2011, Barcelona, Spain



Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg

<Prev in Thread] Current Thread [Next in Thread>