procmail
[Top] [All Lists]

Re: Again on spam with targeted meaningful text

2004-03-20 17:11:59
[I'm responding with findings on-list per Marco's request]
I ran the sample message through a few checks. The message consists of a lot of extraneous text embedded in HTML, using near-invisible (tiny, light-on-white color) relating to African issues, with an embedded porn ad image, with a framing href as the only spammy features. The spam had apparently been carefully targeted at members of this list, so contained "good bayes words" in abundance.

Spamassassin, in my configuration, didn't hit much:

--- spamassassin scoring ---
 0.1 HTML_MESSAGE           BODY: HTML included in message
 0.1 BIZ_TLD                URI: Contains a URL in the BIZ top-level domain
 1.5 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net
[Blocked - see <http://www.spamcop.net/bl.shtml?68.126.135.169>] 0.3 DNS_FROM_RFCI_DSN RBL: From: sender listed in dsn.rfc-ignorant.org

Here's the URL:
<a href="http://1q-2w3e4r-4r5t6jab1y.biz/qa12ws-3ed4rf/b2900-12tj-02sec2/index.htm?butyl";>
--- spamassassin scoring ---

bogofilter actually flagged it as spam, again based on my training. Unfortunately, this was because of the similarity of the "good" text to the so-called "nigerian scams" that are so prevalent.

--- bogofilter scoring ---
Bogofilter Report:
X-Spam-Bogosity: Yes, tests=bogofilter, spamicity=0.504700, version=0.17.2

   int  cnt   prob  spamicity histogram
[...]
"bigger"                             2  0.000000  0.007812  0.997090 +
"chief"                              2  0.000000  0.007812  0.997090 +
"directors"                          2  0.000000  0.007812  0.997090 +
"documentary"                        2  0.000000  0.007812  0.997090 +
"film"                               2  0.000000  0.007812  0.997090 +
"foster"                             2  0.000000  0.007812  0.997090 +
"officials"                          2  0.000000  0.007812  0.997090 +
"president"                          2  0.000000  0.007812  0.997090 +
"Africa"                             3  0.000000  0.011719  0.998056 +
"African"                            3  0.000000  0.011719  0.998056 +
"moving"                             3  0.000000  0.011719  0.998056 +
[...]
--- bogofilter scoring ---

Keep in mind, this is based on MY training of bayes. Yours presumably would not score such words as spam. Ifile (also bayes) seems to have tagged it, but again, since the only spam in the body are the href and image, presumably because of content that indicates spam in my message store.

--- ifile scoring ---
ifile Report:
/tmp/spamreport-msg.vDCSWU spam
spam -5462.30123663
ham -5764.29786015
--- ifile scoring ---

ditto for spamprobe:

--- spamprobe scoring ---
Spamprobe Report:
GOOD 0.0000000 2498472742d6ecd69aa1fe3518790d05
[...]
       Spam Prob   Count    Good    Spam  Word
[...]
       0.0000100       1      94       0  the session
       0.0000112       4      28       0  the au
       0.0000112       3      28       0  vice president
       0.0000101       2      31       0  how they
       0.0000121       1      26       0  space in
       0.0000131       1      24       0  to host
       0.0000149       1      21       0  society for
       0.0000196       1      16       0  a location
[...]
       0.9999492       1       0       5  csseditor
--- spamprobe scoring ---

So it successfully defeated bayes, or at least caused mine to score it for the wrong reasons. And it didn't have a LOT of spammy characteristics.

As this URL was embedded in the body, based on what I've been told here, those checks wouldn't be gain much if re-implemented in procmail. If this were to become a persistent problem, I'd probably lean towards working up some additional spamassasin rules. For one thing, that URL is hardly typical, and existing "random letter" detection rules could be adapted. The message did have the "bayes-beating text" as embedded HTML, with tiny, near-invisible text. I don't run spamassassin rules to detect those, but presumably they'd help. Domain length checks might help. The domain itself seems to have been randomly generated, so a fixed list of domains wouldn't be overly useful.

So in short: I don't think bayes will be good at stopping these (or rather any "targeted spam" like this), but spamassassin cumulative scoring based on matching header AND body indiactors would.

Now, if it would be useful, simply stripping the offending content (mime-encoded) would keep the offensive ads out, though it wouldn't stop the spam in any way. That might be one of the layers of defense applied to "unknowns".

- Bob


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>