Re: [Asrg] "Uncaught spam" research project

Martijn Grooten wrote, On 4/30/10 10:37 AM:
[...]


[1] Spam in the context of this email is spam sent to spam traps. So the
real, proper spam, not the perhaps-not-100%-CAN-SPAM-compliant spam.

That skews your sample quite a bit. A significant fraction of the hard casesthese days are CAN-SPAM compliant campaigns sent by well-meaning originatorsusing mostly-whitehat ESP's. The vast majority of spam is nothing like that,but the vast majority is also fairly easy to shun for all but the largestand cheapest providers. The spam that ends up in a real user's normal Inbox(where it is most annoying to them) is unlikely to ever hit a trap address.On the other side, traps vary in provenance so there is a lot of variationin what they get but as a direct corollary of the fact that you can call anaddress a trap, the mail it is sent will be coming from less careful andclueful spammers. That tilt won't be the same for every sort of trap, but itwill be there for all traps to some degree.

That does not make your research pointless, but it is important tounderstand the skew and the subjective "cost" of different sorts of uncaughtspam in applying whatever you find.

[2] Several of these make use of open source filters (e.g. SpamAssassin),
so it's fair to say that most filters are covered.

Not so much. One real world distinction between vendorware wrapping SA andSA deployed consciously is that the former is often bought as an alternativeto employing skilled staff, while the latter is likely to be a tool that isconstantly being adjusted and enhanced by skilled staff. Sites vary in whatspam they get, what non-spam they get, and FP tolerance. Spammers adaptsomewhat over time to filtering tactics, especially to SA because it is thedominant open source filtering tool. A commercial filter built around SA islikely to use an older version with its ornate configurability configured ina manner so that by default it is safe for any site and exposed to localadjustment in only the simplest ways, while a well-managed SA deployment islikely to be kept current and to have departures from the distributed configdefaults that would be intolerable for other sites.

Again, that is an issue that speaks primarily to interpretation andapplication of your results, rather than to the whole plan. If you don'thave a skilled SA wrangler handy you cannot test the results of a customizedand tuned SA deployment, and that is one of the strong arguments againstusing straight SA in the real world.

[3] I would love to include DKIM, but I can only distinguish between does
have and does not have a DKIM-signature; the redacting of emails to hide
the original recipient makes me unable to decide whether a present
signature was actually valid.

Probably not a big loss in itself, as DKIM correlation with spamminess isvery sensitive to the sort of mailstream one has, and in complex ways.

HOWEVER, this raises a very serious design pitfall. You need to make sure ifyou plan to feed redacted messages to filters that they can be made toignore whatever redaction-spoor is in those messages. The simplest exampleis the one you give: broken signatures. A valid signature may correlate verypoorly to validity of the mail while a bad signature may correlate quitewell to invalidity. As a general rule, redaction for the purpose of hidingindividual identity naturally tends to make all of the redacted messages alittle bit more like each other in ways that may be obvious or may besubtle, and many filters are designed to look for patterns of similarity asevidence of spam.

Ultimately I think it is so hard to be sure that you are avoidingsignificant effects from redaction that researching filters with redactedinputs is a total waste of time. I don't really understand the point ofredaction for this anyway, since the addresses are traps. The case forrecipient redaction may be plausible when spam is being reported in publicor to untrusted parties, i.e. to protect individual privacy or the effectsof disclosing a trap address. For filter testing with trap spam, the onlyrisk of disclosure would be if a filter works in some collaborative fashionakin to DCC or Razor/Pyzor. If you are afraid of that resulting in harmfuldisclosure of your trap addresses, then you have an intractable problem. TheGIGO principle applies, and any modification of your inputs from what afilter would see in the real world makes your inputs garbage.

_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg