Martijn Grooten wrote, On 4/30/10 10:37 AM:
[...]
[1] Spam in the context of this email is spam sent to spam traps. So the
real, proper spam, not the perhaps-not-100%-CAN-SPAM-compliant spam.
That skews your sample quite a bit. A significant fraction of the hard cases
these days are CAN-SPAM compliant campaigns sent by well-meaning originators
using mostly-whitehat ESP's. The vast majority of spam is nothing like that,
but the vast majority is also fairly easy to shun for all but the largest
and cheapest providers. The spam that ends up in a real user's normal Inbox
(where it is most annoying to them) is unlikely to ever hit a trap address.
On the other side, traps vary in provenance so there is a lot of variation
in what they get but as a direct corollary of the fact that you can call an
address a trap, the mail it is sent will be coming from less careful and
clueful spammers. That tilt won't be the same for every sort of trap, but it
will be there for all traps to some degree.
That does not make your research pointless, but it is important to
understand the skew and the subjective "cost" of different sorts of uncaught
spam in applying whatever you find.
[2] Several of these make use of open source filters (e.g. SpamAssassin),
so it's fair to say that most filters are covered.
Not so much. One real world distinction between vendorware wrapping SA and
SA deployed consciously is that the former is often bought as an alternative
to employing skilled staff, while the latter is likely to be a tool that is
constantly being adjusted and enhanced by skilled staff. Sites vary in what
spam they get, what non-spam they get, and FP tolerance. Spammers adapt
somewhat over time to filtering tactics, especially to SA because it is the
dominant open source filtering tool. A commercial filter built around SA is
likely to use an older version with its ornate configurability configured in
a manner so that by default it is safe for any site and exposed to local
adjustment in only the simplest ways, while a well-managed SA deployment is
likely to be kept current and to have departures from the distributed config
defaults that would be intolerable for other sites.
Again, that is an issue that speaks primarily to interpretation and
application of your results, rather than to the whole plan. If you don't
have a skilled SA wrangler handy you cannot test the results of a customized
and tuned SA deployment, and that is one of the strong arguments against
using straight SA in the real world.
[3] I would love to include DKIM, but I can only distinguish between does
have and does not have a DKIM-signature; the redacting of emails to hide
the original recipient makes me unable to decide whether a present
signature was actually valid.
Probably not a big loss in itself, as DKIM correlation with spamminess is
very sensitive to the sort of mailstream one has, and in complex ways.
HOWEVER, this raises a very serious design pitfall. You need to make sure if
you plan to feed redacted messages to filters that they can be made to
ignore whatever redaction-spoor is in those messages. The simplest example
is the one you give: broken signatures. A valid signature may correlate very
poorly to validity of the mail while a bad signature may correlate quite
well to invalidity. As a general rule, redaction for the purpose of hiding
individual identity naturally tends to make all of the redacted messages a
little bit more like each other in ways that may be obvious or may be
subtle, and many filters are designed to look for patterns of similarity as
evidence of spam.
Ultimately I think it is so hard to be sure that you are avoiding
significant effects from redaction that researching filters with redacted
inputs is a total waste of time. I don't really understand the point of
redaction for this anyway, since the addresses are traps. The case for
recipient redaction may be plausible when spam is being reported in public
or to untrusted parties, i.e. to protect individual privacy or the effects
of disclosing a trap address. For filter testing with trap spam, the only
risk of disclosure would be if a filter works in some collaborative fashion
akin to DCC or Razor/Pyzor. If you are afraid of that resulting in harmful
disclosure of your trap addresses, then you have an intractable problem. The
GIGO principle applies, and any modification of your inputs from what a
filter would see in the real world makes your inputs garbage.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg