This is a good point. More quantitative rather than anecdotal data would be
useful. We started spamarchive.org a few months ago to provide such a
standard and open spam corpus. (The latest archives are not online right now
as we are changing hosting facilities, but they are available for those that
would like to use them.) Another missing piece is a set of tools for
anonymizing, measuring, and analyzing spam data. I mention this and give
some examples in my talk at the spam conference
(http://www.spamconference.org/proceedings2003.html).
Another relevant thread is one started by Kee Hinckley called "Re: [Asrg]
Back to the charter". It aims to categorize spam by the technique used to
send it.
-----Original Message-----
From: Fred Bacon [mailto:bacon(_at_)aerodyne(_dot_)com]
Sent: Sunday, March 09, 2003 6:59 PM
To: Paul Judge
Cc: 'Asrg (asrg(_at_)ietf(_dot_)org)'
Subject: Re: [Asrg] ASRG work items
Please forgive me if I restate what has already been
discussed. I have spent the afternoon going through the mail
archive, but I could not possibly read every message.
On Sun, 2003-03-09 at 14:38, Paul Judge wrote:
Milestones/Deliverables:
1. problem statement/ requirements document
Keith Moore and Balachander Krishnamurthy have started a good thread
on
"requirements for a proposed solution + notion of consent" (also
called
"evaluating proposals against requirements").
I would like to comment on the first milestone. It seems to
me that there is considerable disagreement on even so simple
a matter as the amount of spam with forged addresses. I
believe that one of the first items which should be addressed
is a quantitative assessment of the methods and varieties of
spam. Part of this would be a standard spam corpus against
which filters could be tested. But there should be other
quantitative activities as well. The spam messages in the
corpus are only a part of the data. Log file entries related
to those messages should also be recorded and maintained for
analysis. For instance, what percentage of spam messages
really do come from open relays in this day and age? Can
anyone say for certain? Where is the data to determine this?
I suggest that an early goal for ASRG should be to develop
and distribute a standard set of spam collection and analysis
tools. These tools should be suitable for instrumenting
servers (either production mail servers or honeypots) and
building an extensive database of spam for analysis. No
source of information should be ignored. Everything should
be recorded, the message, the server logs and all related TCP packets.
Of course, great care would need to be taken to protect the
privacy of the spam recipients. Honeypots may be the only
viable method for this level of data collection. In fact, I
would recommend a network of honepots in different TLDs and
geographic locations.
I hope this suggestion is useful.
Fred Bacon
Senior Scientist
Aerodyne Research, Inc.
Billerica, MA 01821
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg