ietf-asrg
[Top] [All Lists]

Re: [Asrg] "Uncaught spam" research project

2010-04-30 13:55:30
Martijn Grooten wrote, On 4/30/10 10:37 AM:
[...]

[1] Spam in the context of this email is spam sent to spam traps. So the
real, proper spam, not the perhaps-not-100%-CAN-SPAM-compliant spam.

That skews your sample quite a bit. A significant fraction of the hard cases these days are CAN-SPAM compliant campaigns sent by well-meaning originators using mostly-whitehat ESP's. The vast majority of spam is nothing like that, but the vast majority is also fairly easy to shun for all but the largest and cheapest providers. The spam that ends up in a real user's normal Inbox (where it is most annoying to them) is unlikely to ever hit a trap address. On the other side, traps vary in provenance so there is a lot of variation in what they get but as a direct corollary of the fact that you can call an address a trap, the mail it is sent will be coming from less careful and clueful spammers. That tilt won't be the same for every sort of trap, but it will be there for all traps to some degree.

That does not make your research pointless, but it is important to understand the skew and the subjective "cost" of different sorts of uncaught spam in applying whatever you find.

[2] Several of these make use of open source filters (e.g. SpamAssassin),
so it's fair to say that most filters are covered.

Not so much. One real world distinction between vendorware wrapping SA and SA deployed consciously is that the former is often bought as an alternative to employing skilled staff, while the latter is likely to be a tool that is constantly being adjusted and enhanced by skilled staff. Sites vary in what spam they get, what non-spam they get, and FP tolerance. Spammers adapt somewhat over time to filtering tactics, especially to SA because it is the dominant open source filtering tool. A commercial filter built around SA is likely to use an older version with its ornate configurability configured in a manner so that by default it is safe for any site and exposed to local adjustment in only the simplest ways, while a well-managed SA deployment is likely to be kept current and to have departures from the distributed config defaults that would be intolerable for other sites.

Again, that is an issue that speaks primarily to interpretation and application of your results, rather than to the whole plan. If you don't have a skilled SA wrangler handy you cannot test the results of a customized and tuned SA deployment, and that is one of the strong arguments against using straight SA in the real world.

[3] I would love to include DKIM, but I can only distinguish between does
have and does not have a DKIM-signature; the redacting of emails to hide
the original recipient makes me unable to decide whether a present
signature was actually valid.

Probably not a big loss in itself, as DKIM correlation with spamminess is very sensitive to the sort of mailstream one has, and in complex ways.

HOWEVER, this raises a very serious design pitfall. You need to make sure if you plan to feed redacted messages to filters that they can be made to ignore whatever redaction-spoor is in those messages. The simplest example is the one you give: broken signatures. A valid signature may correlate very poorly to validity of the mail while a bad signature may correlate quite well to invalidity. As a general rule, redaction for the purpose of hiding individual identity naturally tends to make all of the redacted messages a little bit more like each other in ways that may be obvious or may be subtle, and many filters are designed to look for patterns of similarity as evidence of spam.

Ultimately I think it is so hard to be sure that you are avoiding significant effects from redaction that researching filters with redacted inputs is a total waste of time. I don't really understand the point of redaction for this anyway, since the addresses are traps. The case for recipient redaction may be plausible when spam is being reported in public or to untrusted parties, i.e. to protect individual privacy or the effects of disclosing a trap address. For filter testing with trap spam, the only risk of disclosure would be if a filter works in some collaborative fashion akin to DCC or Razor/Pyzor. If you are afraid of that resulting in harmful disclosure of your trap addresses, then you have an intractable problem. The GIGO principle applies, and any modification of your inputs from what a filter would see in the real world makes your inputs garbage.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg

<Prev in Thread] Current Thread [Next in Thread>