At 11:56 AM -0400 2003/08/13, Paul Judge wrote:
2. As you mentioned, with blacklists you need the list of IP addresses. The
problem is that the list of IP addresses in the headers will often include
IPs of internal mail servers that organizations do not wish to reveal. So,
you often have to reduce this to the set of IP addresses that come before
the recipient's organization in order to make this data public.
For larger organizations, you may pass through multiple different
network blocks. I submit that it won't be programmatically possible
to detect and eliminate all of them. IMO, the best you can hope to
do is to avoid the last hop in the "Received:" headers, and anything
else on that same network.
And that's assuming that there isn't internally generated spam
being sent by one customer of the ISP to another of the same ISP.
Then there are RFC 1918 network blocks to be considered (or eliminated).
I think it might be easier to solve this problem by comparing the
"candidate spam" IP addresses against the "candidate ham" IP
addresses, and see if there are any duplicates. If there are, then
they get removed from the "candidate spam" list (to try to avoid
additional false positives).
There are many intricacies here. The SpamAssassin guys have experienced them
and within Spam Archive we've experienced them. It's just not as simple as
you initially thought. It's far from impossible, but just requires some
thoughtfulness. That is why I was outlining these three paths as potential
paths for individuals to spend some time pursuing.
Indeed. Lots of nuances. And we've only started to begin to
consider scratching the surface.
Speaking of information sources, it strikes me that we might be
able to get the complete archives of relatively large numbers of
mailing lists, most of which should either have a high percentage of
"ham", or be something that can be processed according to modern
anti-spam methods and sorted into "candidate spam" vs. "candidate
ham".
For example, I know the listmaster at Apple, and he might be able
to help us. Through the mailman mailing list (which Chuq also helps
to run), we might be able to get archives of other large sources,
especially including any of the lists hosted at python.org. I might
also be able to dig up some contacts at AOL for their ListServ box.
Would any of these information sources be of potential interest?
I mean, we're talking mailing lists with cumulative hundreds of
thousands and maybe even millions of users, which should result in
extremely large quantities of messages that could potentially be
examined. Indeed, most of them are probably already publicly
available via archives, it would just be a matter of getting more
convenient access to them.
--
Brad Knowles, <brad(_dot_)knowles(_at_)skynet(_dot_)be>
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety."
-Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+
!w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++)
tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg