procmail
[Top] [All Lists]

how to strip SA's message markup?

2004-02-21 17:27:09

Dallman,

I'd like to write a quick recipe for removing spamassassin's markup.
('spamassassin -d does that, but when piped from formail on a 50,000 message
corpus, it will literally take all day.) I could write a Perl program, and
invoke SA's methods directly, but thought that procmail should be up for the
task.
I haven't done much MIME hacking in procmail, so am looking for a few
suggestions.

Here are the bits of interest:

From Nancedkatt(_at_)excite(_dot_)com  Fri Feb 20 13:51:57 2004
From: Englandbwse <Nancedkatt(_at_)excite(_dot_)com>
To: gary(_at_)intrepid(_dot_)intrepid(_dot_)com
Subject: Fast generic solution better than VIAAGRRA
Date: Fri, 20 Feb 2004 15:45:21 -0600
Content-Type: multipart/mixed; boundary="----------=_4036818D.011B2C5E"

------------=_4036818D.011B2C5E
Content-Type: text/plain
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Content preview:  Dis.count Ph.armacy Onlin.e Sa.ve up t.o %8O orde.ring
Content analysis details:   (7.8 points, 5.0 required)

------------=_4036818D.011B2C5E
Content-Type: message/rfc822; x-spam-type=original
Content-Description: original message before SpamAssassin
Content-Disposition: attachment
Content-Transfer-Encoding: 8bit

From: Englandbwse <Nancedkatt(_at_)excite(_dot_)com>
To: gary(_at_)intrepid(_dot_)intrepid(_dot_)com
Subject: Fast generic solution better than VIAAGRRA
Date: Fri, 20 Feb 2004 15:45:21 -0600
Content-Type: text/html; charset=ascii-us
Content-Transfer-Encoding: 7Bit

------------=_4036818D.011B2C5E--


=======================

So, I'm guessing that I'll ingore MIME's continuation rules and hope that
those content fields are the same line.  Would the recipe go something like
this?

SPACE=" "
TAB="   "
WS="$SPACE$TAB"

:0 B
* $ H ?? ^Content-Type:[$WS]+multipart/mixed;
* $ ^Content-Type:[$WS]+message/rfc822;[$WS]+x-spam-type=original
* $ ^Content-Description:[$WS]+original message before SpamAssassin
* $ ^Content-Disposition:[$WS]+attachment
{
# SA markup is present

# Pick up envelope From_
:0
* $ ^^\/From[$SPACE].*$
{ FROM_ = "$MATCH" }
# Extract MIME boundary
:0
* ^Content-Type:[$WS]+multipart/mixed;[$WS]+boundary="\/[^"]*
{ BOUNDARY = "$MATCH" }

# And at this point, I'll feed the body through 'sed' and pick off
# the second $BOUNDARY delimited part? Easiest/best way to do that?
# We also add $FROM_ at the beginning of the extracted message.
:0 bfw
| (echo '"$FROM_"'; sed ....)

# at this point, the body is the message, so we discard the header,
# by delivering the raw body?
:0 rb:
$DEFAULT

# I'm using $DEFAULT above, because it may be equated to '|', and
# and I don't know if 'rb' will work the way I want when used as
# a filter.

}

:0 E:
$DEFAULT

---------------------------------------

It got a little more hairy than I thought it would, and I still have the
'sed' part to do ....



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail