ietf-asrg
[Top] [All Lists]

Re: [Asrg] spamstones

2003-04-01 15:34:32
From: bukys(_at_)cs(_dot_)rochester(_dot_)edu

...
Need HAM:

Since we can't define "spam," could we please avoid other cute words?
Every time I see "ham" I wonder if it was intended to mean "not-spam,"
"known to be valuable mail," "solicited bulk mail," or something else.
When "ham" is intended to mean "not-spam," please don't compound the
ambiguity by saying something other than "not-spam."

After we settle on definitions, the main missing ingredient is a good
HAM corpus attached to similarly-sampled SPAM.  Multi-language ham is
especially needed (I know SpamAssassin team has issued a call for it,
don't know if it will arrive.)

Need SNAPSHOTS:

In addition, time-indexed snapshots of external sources of information
(RBLs, DCC, Razor, etc) would be helpful as well.  Does anyone know
whether the operators of those retain any historical data?

I disagree.  Historical samples of spam (whatever that is) and not-spam
(ditto) are almost useless after 6 months except to historians unless
you have very modest filtering ambitions.  You need current and future
spam, not what Spamford was sending 7 years ago.

Anyone who wants current or historical spam samples (for spam=whatever
recipients think is spam) can check the Google archives of
news.admin.net-abuse.email or the nearest spool directory.

I have been accumulating some of both, I'm not sure whether I'll publish
it yet, or in what form (blinded or clear).  Unfortunately I have very
little non-English ham, so my learning classifiers always lump all
Chinese, Portugese, and Turkish text into the spam category.

Thanks to someone at an ISP using the DCC in a country where Spanish
is the native language, the DCC knows enough Spanish for its purposes.
I think anyone with similar problems should do the equivalent.
If you don't have a very good idiomatic command of the language, it
is difficult be certain that a given message is probably unsolicited
bulk email or whatever you define as spam.

I don't see how the operators of DNS blacklists could have any samples
of spam except their own.  Only IP addresses go over the net from
clients to their systems.  A major design goal of the DCC is that
nothing that can be readily converted to mail contents is sent over
the wire, and so DCC users have only their own archives.  I assume
something similar applies to Razor/Pyzor/Cloudmark.  In theory you
might have better luck talking to Postini or Brightmail, but I bet in
practice they consider their archives proprietary as well as secret to
protect the privacy of spam targets and spam senders.


Vernon Schryver    vjs(_at_)rhyolite(_dot_)com
_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>