procmail
[Top] [All Lists]

Parsing MIME headers (LONG) [Was: Re: Procmail ruleset: what is wrong?]

2000-05-14 11:10:44
David Collantes <david(_at_)bus(_dot_)ucf(_dot_)edu> writes:
...
Once again, what I want to do sounds simple, but I haven't been able to
make it work and no one has been able to come up with something that works
neither. I am looking for a recipe to catch text/html or
multipart/mixed/alternative... in other words... anything that is not
plain text, add a banner at the begining of the body and let it go through
normal delivery.

Here's the beginning of a solution.  It directly handles text/plain and
multipart/mixed content-types.  To handle text/html you'll need a filter
that can stick the data at the top of <BODY>, but otherwise is easy to
add below.  For other types it generates a new multipart/mixed message
with two parts, the first being the desired banner and the second being
the original content.  The only hard part to doing that is generating
a boundary string that doesn't occur in the original content.  Doing so
is left as an exercise for the student.

Anyway, step one is to upgrade to procmail version 3.14 or later, then
download my rfc822 parsing rcfiles from
        ftp://www.gac.edu/pub/guenther/822rcs.tar.gz

and place them in a suitable location.  Read the included README file.
These are used to parse the Content-Type: header field and extract the
boundary string for multipart/mixed messages.

Then create the following rcfiles:

content-type-rc:
        :0
        * TYPE ?? ^^multipart/mixed^^
        {
            # param-rc will save the boundary string for the main rcfile
            SWITCHRC
        }

        # Unset TEXT to avoid parsing any content-type parameters
        TEXT

        :0 fhw
        * TYPE ?? ^^text/plain^^
        | cat - text-plain-warning-message
            
        # Uncomment the following if you have a filter to add a banner
        # to html
        # :0 Efbw
        # * TYPE ?? ^^text/html^^
        # | add-banner-to-html

        :0 E
        {
            # Wrap it in a new multipart/mixed
            SWITCHRC = /path/to/wrap-message
        }

wrap-message:
        # Generate an original boundary string
        :0 b
        BOUNDARY=|generate-boundary-string

        # Save the Content-* headers for the nested body
        :0 h
        CONTENT_=|formail -XContent-

        # Create the new MIME headers
        :0 fh
        |formail -IContent- \
            -A"Content-Type: multipart/mixed; boundary=\"$BOUNDARY\""

        # Fixup the body
        :0 fb
        |echo "--$BOUNDARY"; \
           echo ""; cat text-plain-message; echo ""; \
         echo "--$BOUNDARY"; \
           echo "$CONTENT_"; echo ""; cat -; echo ""; \
         echo "--$BOUNDARY--"


param-rc:
        :0
        * PARAM ?? ^^boundary^^
        {
            # Okay, we have the boundary string.  Insert the warning and
            # a copy of the boundary after the first instance of the
            # boundary in the message.  This could be done in sed or awk
            # with enough work, but it's easier in perl with the \Q...\E
            # (quotemeta) operator to protect us from regexps specials in
            # the boundary string.
            # Unset SHELLMETAS to save a shell here.
            oSM=$SHELLMETAS SHELLMETAS
            :0 fb
            | perl -pe \
                 'if (/^--\Q$ENV{VALUE}\E$/) { \
                    print "$_\n"; \
                    open(B,"text-plain-warning-message"); \
                    while(<B>){print} \
                    close(B); \
                    print "--$ENV{VALUE}\n"; \
                    while (<>) {print} \
                    last; \
                  }'
            SHELLMETAS=$oSM

            TEXT
        }

error-rc:
        # There was a syntax error in the Content-Type: header.  Shoot.
        # Currently, the 822 parsing rcfiles cannot recover from errors:
        # you can't unwind the stack enough.  The only real workaround
        # I can think of right now is to add a "do not process" header
        # field and resend the message the recipient.  At least we can
        # wrap the message before hand.
        #
        #       I did say "workaround"!
        #
        INCLUDERC = /path/to/wrap-message
        :0 fh
        | formail -I"Dont-Add-Banner: yes"

        # Handle plus address correctly and save the envelope sender
        :0
        * ^Return-Path:\/.*
        ! -f "$MATCH" -- "$LOGNAME${1:++$1}"



Finally, put the following assignments and recipe in your main rcfile:

        :0
        * ^Mime-Version:[       ]*1\.0
        * ^Content-Type:\/.*
        {
            # Avoid looping if there was a previous parse error
            :0
            * ! ^Dont-Add-Banner: *yes
            {
                # Okay, it's MIME.  Setup and invoke the rfc822 routines
                TEXT = $_
                contenttyperc = /path/to/content-type-rc
                paramrc = /path/to/param-rc
                ERRORRC = /path/to/error-rc
                _rcfileprefix = /path/to/rfc822rcfiles/822
                INCLUDERC = ${_rcfileprefix}content-type
            }
        }
        :0 Efhw
        | cat - text-plain-warning-message



At this point I figure that I've either confused you completely or
made you want to vomit.  Does it really have to be this complicated?
Well, yes and no.  Parsing mail headers and MIME bodies Is Not Simple.
There is no way to correctly do it with just regexps, so you must either
use recursive rcfiles or farm it out to a program written in C, Perl,
Python, whatever.  Farming it out is almost certainly simpler for some
of the cases.  For example, you could do a quick check to see if the
Content-Type: header field either:

a) is commentless through the type/subtype part and specifies a type
   besides multipart/mixed, OR
b) specifies a type of multipart/mixed, contains no comments at all, and
   the boundary parameter has no quoted-pairs ("blah\"blah")

In case (a) you can either tack on the banner, deal with the HTML, or
wrap it.  In case (b) you can insert the leading banner with the block
of perl seen above in param-rc.


In other words, the partial solution present above is probably *not*
what you want to do.  Handling the simple cases in procmail and then
farming out to perl the difficult ones is a much better idea.  However,
I've too tired to rewrite it and it may give you ideas.  Whoever it was
that asked about looping in procmail can try reading the rfc822 rcfiles
to see how I do it there.


Good Luck...


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>