RE: [Asrg] A New Plan for No Spam / Velocity Indicator

From: Jim Youll <jim(_at_)media(_dot_)mit(_dot_)edu>

...

 > >   - A definition of the spam problem based on 89 messages received by

I am bothered by the facile talk from many quarters of a spam corpus
of a few thousand messages collected over months or years. ...

So, 89 isn't a very big number but...

when you back out the *redundancy* (most of those 1,000,000s a day
are like most of the others) what's really there? Granted there may be
some things that cannot be measured without knowing the true volume,
but surely useful information can be derived.


Anecdotes contain useful information.  They can tell you that something
is happening.  What they cannot tell you is most of what
http://www.verisign.com/resources/wp/spam/no_spam.pdf seems to conclude
on page 2 and 3 (or 4 and 5 as counted by Ghostscript).  89 examples
can tell you that some spam comes from Korea and so forth.  Without
a lot of supporting evidence about the sampling, even 890,000 collected
over 60 hours cannot say whether 0.35%, 3.5%, or 35% of spam in general
is "mostly Korean, Chinese, and Japanese."

The same applies to every spam corpus that that has been mentioned in
this mailing list.  Despite its apparent skew from what I see as normal
spam (e.g. far too many virus/worms), those 89 messages are better
than some of the much larger collections, because more is known about
the nature of the target address(es?).

This should be about engineering.  That suggests that while it is
useful to know that some spam is, quantification is required.  Honest
quanitifcation of spam characteristics cannot be supported by tiny or
even large ad hoc collections.  I don't know where to start to build
a representative sample of spam except by collecting a significant
fraction of the total.  That would amount to 100,000,000 spam/day.
Even that tactic would have the major problem of collecting only spam
that your sampling mechanisms recognize as spam.  If your machinery
assumes that all mail that carries a certificate is not spam, then
you might conclude certificates solve the spam problem even if Ralsky
had purchased a certificate, chip, or whatever and was pumping out
1,000,000,000 messages/day.


 ...

Note that contrary that document, it generally makes no sense to talk
about valid BCC fields in incoming legitimate mail.  See page 37 of
RFC 2822.  I wouldn't be surprised if SpamAssassin treats a BCC
header as evidence of evil, because I can't recall ever seeing one
except in spam.  More important and also contrary to that document,
RFC 822 and RFC 2822 do *NOT* require that every incoming message name
the receipient in any SMTP header.  Most legitimate mailing list
messags are cases in point.

What discourages me most about this mailing list are self-described
expert statements such as "the RFC822 message standard requires every
message to have a valid To: CC: or BCC: field identifying the recipient,
making adjustment where necessary to account for messages relayed
through mailing lists."  There are obvious reasons why RFC 822 and
RFC 2822 do not require mailing list exploders to anything of the sort.


Vernon Schryver    vjs(_at_)rhyolite(_dot_)com
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg