procmail
[Top] [All Lists]

RE: how to strip SA's message markup?

2004-02-22 09:56:48


Lukreme de la Kreme wrote:
Sent: Sunday, February 22, 2004 8:07 AM

On 21 Feb 2004, at 17:17, Gary Funck wrote:
I'd like to write a quick recipe for removing spamassassin's markup.
('spamassassin -d does that, but when piped from formail on a 50,000
message corpus,

Why would you involve formail?  Doesn't spamassassin handle mboxes
anymore?


Nope. SA never did. SA accepts one mail message at a time. The other way to
do mass extractions is to run mass-check out of the 'masses' directory in
the SA distribution, and this will accept mboxes and save some overhead.
Still, 'mass-check' does other things as well, and is kind of a clunky
way to remove markup.

I also wanted the option of plugging this recipe back into my local
procmail recipe, to file a clean copy of spam in a different mbox, so
that I can more easily incorporate the spam mail into a corpus at a
later time.

 it will literally take all day.)

I recently fed over 250,000 messages through SA and it took less than
all day.


How did you do that without using something like formail to split out the
messages?

I just tried:
  spamassassin -d < spam.mbox > clean_spam.mbox
on an mbox with 20 (marked up) spam messages. Although it did seem to remove
all
the various SA artifacts - it left me with a somewhat discombobulated mbox,
which
among other things contains only a *single* 'From_' line.

I could write a Perl program, and invoke SA's methods directly, but
thought that procmail should be up for the task.

If all you want to do is delete the SA wrapper you simply have to
remove all the lines in the message up to the first ^Return-Path: line.
  That removes the SA headers.  Then remove the final MIME boundary (it
will be the last line with text and will start with ------)


Yeah. That would've been easier. I just chose a slightly more general
method in the sed script. It only added three/four additional lines,
and is somewhat less sensitive to the details of SA's report attachment
format.

Once that is done, then you need to recreate the From if you are using
mbox to store the mail.


I did this in the 'sed' script by simply copying the first (From_) line
off to the hold buffer where I kept the rest of the original message.

However, if you have 50,000 messages tagged as SPAM by SA that are not
spam... well, that's a problem as well.

The spam mbox has a mix of marked up and non-marked up messages because
some are hand-filed when they're mis-scored as spam and mis-delivered
to the inbox. Good point though. In the case where the markup attachment
isn't present, I should still remove the X-Spam- headers.

(The procmail recipe clicked off about 100 messages/sec. on a 2.4Ghz P4.)



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail