Re: [Asrg] "Uncaught spam" research project

Bill Cole wrote:

That skews your sample quite a bit. A significant fraction of the hard
cases
these days are CAN-SPAM compliant campaigns sent by well-meaning
originators
using mostly-whitehat ESP's. The vast majority of spam is nothing like
that,
but the vast majority is also fairly easy to shun for all but the
largest
and cheapest providers. The spam that ends up in a real user's normal
Inbox
(where it is most annoying to them) is unlikely to ever hit a trap
address.
On the other side, traps vary in provenance so there is a lot of
variation
in what they get but as a direct corollary of the fact that you can
call an
address a trap, the mail it is sent will be coming from less careful
and
clueful spammers. That tilt won't be the same for every sort of trap,
but it
will be there for all traps to some degree.

That does not make your research pointless, but it is important to
understand the skew and the subjective "cost" of different sorts of
uncaught
spam in applying whatever you find.


Oh, I absolutely agree. There are many reasons why I want to concentrate on 
spam-trap spam (perhaps the most important one being that for these messages 
it's so much easier to decide whether they are actually both spam and unwanted 
-- they are both more or less by definition), but I am well aware that it only 
covers part of the spam. And a part that's already easy to filter (but, 
arguably, also a part where not filtering, especially in the case of phishing, 
can be more dangerous). I might run the same project on a different spam corpus 
at some point in the future.

Again, that is an issue that speaks primarily to interpretation and
application of your results, rather than to the whole plan. If you
don't
have a skilled SA wrangler handy you cannot test the results of a
customized
and tuned SA deployment, and that is one of the strong arguments
against
using straight SA in the real world.


Good point. For this very reason I'm not testing SA as part of the comparative 
test; that just wouldn't be fair on it.

[3] I would love to include DKIM, but I can only distinguish between

does

have and does not have a DKIM-signature; the redacting of emails to

hide

the original recipient makes me unable to decide whether a present
signature was actually valid.


Probably not a big loss in itself, as DKIM correlation with spamminess
is
very sensitive to the sort of mailstream one has, and in complex ways.


Actually, (almost) no spam trap spam I receive has a DKIM signature, valid or 
not.

HOWEVER, this raises a very serious design pitfall. You need to make
sure if
you plan to feed redacted messages to filters that they can be made to
ignore whatever redaction-spoor is in those messages. The simplest
example
is the one you give: broken signatures. A valid signature may correlate
very
poorly to validity of the mail while a bad signature may correlate
quite
well to invalidity. As a general rule, redaction for the purpose of
hiding
individual identity naturally tends to make all of the redacted
messages a
little bit more like each other in ways that may be obvious or may be
subtle, and many filters are designed to look for patterns of
similarity as
evidence of spam.

Ultimately I think it is so hard to be sure that you are avoiding
significant effects from redaction that researching filters with
redacted
inputs is a total waste of time. I don't really understand the point of
redaction for this anyway, since the addresses are traps. The case for
recipient redaction may be plausible when spam is being reported in
public
or to untrusted parties, i.e. to protect individual privacy or the
effects
of disclosing a trap address. For filter testing with trap spam, the
only
risk of disclosure would be if a filter works in some collaborative
fashion
akin to DCC or Razor/Pyzor. If you are afraid of that resulting in
harmful
disclosure of your trap addresses, then you have an intractable
problem. The
GIGO principle applies, and any modification of your inputs from what a
filter would see in the real world makes your inputs garbage.


I agree with you in theory. And I know one should be wary assuming things about 
spammers' behaviour.

However, all that happens when redacting the messages is something along the 
lines of

s/spamtrapdomain/mydomain/g

and

s/spamtrap-local-part/my-local-part/g

Apart from signatures -- which, in practise, turn out to be hardly there -- I 
don't see how this affects the setup.

Martijn.

Virus Bulletin Ltd, The Pentagon, Abingdon, OX14 3YP, England.
Company Reg No: 2388295. VAT Reg No: GB 532 5598 33.
_______________________________________________
Asrg mailing list
Asrg(_at_)irtf(_dot_)org
http://www.irtf.org/mailman/listinfo/asrg