procmail
[Top] [All Lists]

Re: [0.0] matching words that are laced with html

2003-10-29 21:28:46
At 19:53 2003-10-29 -0500, Charles Gregory did say:

I started using SpamAssassin (www.spamassassin.org), which has built-in
functionality to defeat coding tricks like HTML and BASE64.....

And apparently it modifies the subject lines with stuff that wasn't orignally there.

I've been considering the prospect of suggesting new functionality in procmail (on procmail-dev, to which I'm not presently subscribed) wherein procmail could be extended to include internal support for BASE64, ordinal encoding, attachments, etc - depostiting them into separate body variables, with a couple of additional pseudo-arrays of variables containing MIME (or MIME-like) information about the attachments. Problem is, I don't have the time to get into the details with anyone of how to implement it. One idea that stands out in my mind is to include a keyword for actually processing the message in this fashion (so without it, procmail acts just like normal, but if you want mime parsing, you need to set some variable which would cause an aware version of procmail to do the additional parsing).

This would simplify a lot of stuff people currently have to contend with the hard way using external programs (which are nice and modular to a point, but MIME, ORDINALS, QUOTED PRINTABLE, and BASE64 would be nice to see supported within procmail itself).

In the meantime, if you want to strip HTML, try piping the message through lynx. See the following example.

# we work on a COPY of the message, and from that, set flags on how the
# ACTUAL message will be handled.  This is completely untested, because I
# don't happen to use it myself and I'm drafting it right here and now as
# I type.  Corrections are welcome.

        # Nifty trick for getting lynx to take input on stdin.
        # the formail invocation is to retain the body, but no headers.  If
        # I were in a clearer mindset right now, I could probably do much
        # better, but this is what springs to mind.
        # BE WARNED: your logfile will get huge and ugly if you're using
        # verbose, because this assignment of the ENTIRE MESSAGE BODY will be
        # logged.
        B_PLAIN=|`formail -k -X "Null" -I "Null" | lynx -dump -nolist \
                -force_html -pseudo_inlines /dev/fd/0`

        # if the body was HTML (but not BASE64) encoded (or even if not <g>),
        # it should be in 'B_PLAIN' without any HTML comments or the like.
        # This also decodes ordinal encodings.  Embedded links - the actual
        # URLs referred to within the HTML - but perhaps not displayed AS the
        # anchor text, will dissappear.

        # There is ABSOLUTELY NO consideration for multipart messages.

        # you can now simply do matches against B_PLAIN instead of B:
        :0:
        * B_PLAIN ?? (bank mortgage)
        spooge.mbx

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail