Almost a year ago, I posted to the group about a little research project I
intended to set up, where I would compare spam that is easy to filter, i.e.
(almost) all spam-filters I use would block it, to spam that is missed by some
I've finally found the time do have a little go at this. During a few weeks in
November, I looked at the geographical distribution of sending IP addresses in
my total spam corpus and compared that with the sub-corpus of 'difficult' spam.
The latter is defined as spam that is missed by at least two filters,
whereas I used a geoip-lookup to determine the country of origin of a spam
It resulted in the following distributions:
1 Russian Federation 9.86%
2 India 7.67%
3 United States 7.35%
4 Brazil 5.86%
5 Vietnam 5.70%
6 Ukraine 4.84%
7 United Kingdom 3.94%
8 South Korea 2.92%
9 Italy 2.78%
10 Indonesia 2.54%
1 Russian Federation 11.09%
2 India 8.45%
3 Vietnam 5.22%
4 South Korea 4.94%
5 Ukraine 3.83%
7 Indonesia 3.55%
8 China 3.52%
9 United States 3.41%
10 United Kingdom 3.26%
Then, during a few weeks in December, I looked at the distribution of
MIME-types among two similarly defined corpora.
Text and HTML 34.4%
Text and HTML 26.9%
Now I am the first to answer the question 'what does this mean?' with 'not
much', but I wanted to share the results anyway.
My gut feeling is that the fact that plain text messages are 'harder to filter'
has little to do with the fact that they are written in plain text; rather,
that it is a consequence of a certain spam campaign that for some reason gives
filters more problems and just happens to send its messages in plain text.
The fact that spam from some several Asian and other non-Western countries is
'harder to filter' may have to do with the fact that most filters are developed
in Western countries and that communication between local/regional ISPs on
which IP addresses to block may be easier. However, again that is only a guess.
I intend to continue to run tests like this. Suggestions, both on what to test
as well as on how to interpret the results, are more than welcome.
 to avoid the results being skewed by one of the (20) filters having some
kind of issue; I am aware that spam-filters may use the same/similar
technologies so if one such technology (e.g. a blacklist) had an 'issue' this
would still have skewed the results.
 messages with one or more embedded images, regardless of other MIME types
 including DSNs, attached ZIPs/PDFs/Docs, apparently broken MIME.
Martijn Grooten, Anti-spam Test Director, www.virusbtn.com
Tel: +44 1235 540235 / +44 7872 674989
VB2011: 5-7 October 2011, Barcelona, Spain
Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.
Asrg mailing list