procmail
[Top] [All Lists]

Re: MIME attachment killer won't work

2000-01-14 21:56:52

W. Mark Herrick <markh(_at_)va(_dot_)rr(_dot_)com> asked a few days ago:

On a related note, is there any way to do the following:

If MIME (or any, for that matter) attachment, then
  a. Drop MIME part
  b. Autorespond to sender telling them that MIME/any attachment was
     dropped

This would be useful.

Well, when it comes to MIME anyway, it can be done in procmail (with
a few helpers like sed), but it can be a little tricky. Here are the
problems, and some hints if you want to proceed anyway.

To start, let's define some terms and clear up just what
you want to do. We'll deal with part (a) of your request,
the second part is repeated many times in the list archive
<http://www.rosat.mpe-garching.mpg.de/mailing-lists/procmail/>, q.v.

The first problem is identifying attachments. "Simple!" you say. Locate
Content-Disposition: attachment and its variants. That works most of the
time, but not when it's part of a discussion on MIME deletion, such as
in this paragraph. Cute trick, lets go on.

Once we have identified a part, the question is how do we go about
deleting it. The excision itself is pretty simple, and I'll provide a
GNUsed filter script to do that part. The more interesting question is:
How do we ensure that what remains is still valid. The good news is that
we don't have to do much at all. We have to deal with a few different
situations. From my own limited testing, some MUAs (elm and pine) have
problems with malformed MIME messages. Some (mailx) don't. I would guess
that you want to create few, if any problems.

So let's examine come cases.

 1. What if the whole message is an attachment (not multipart, but
    Content-Disposition: attachment in the RFC822 headers).

 2. (Likely) What if there are multiple parts, one of which is an
    attachment?

 3. What if there is only one part (the attachment) in a MIME multipart
    message?

 4. What if the attachment is part of a nested part?

 5. (Also likely) What if there are multiple attachments?

Here's how we'll deal with each.

 1. Detect this with

    :0
    * ^content-disposition:[  ]*attachment
    { action }

    The stuff after 'attachment' ensures that the word isn't part of a
    filename. Fortunately for us, RFC2045 forbids RFC822-style comments
    in content-disposition headers. The disposition-type should follow
    the header name directly, per RFC, although you can code this more
    defensively if you like.

    Removing involves dropping the body and removing the MIME indicia
    from the header. We can try to grab the filename to use in our
    autoreply, and we can insert a dummy body in the message so that
    there is something there when we look at it. We'll also set a flag
    for autoreplying that we can check later. Putting these together:

    :0
    * ^content-disposition:[  ]*attachment
    * ^content-.*filename="?\/[^";]+
    { file=$MATCH   sendNoMIME=yes
      :0 f h w
      | formail -imime-v -icontent-d -icontent-t -icontent-m -icontent-b \
                -A"X-Munged: removed attachment $file from message"
      :0 f b w i
      | echo "$file was here. It is gone."
    }

    Note: Where there are two spaces in character class square brackets,
    I mean space and tab.

 2. (One part in a 2+ part multipart)

    This and the case 5 are the most likely occurences. Detect this one
    thus:

    :0 B       ## first, look for a multipart with one or more attachments
    * H ?? ^content-type:.*multipart/
    * content-disposition:.*attachment($|;| |  )
    { sendNoMIME=yes
      :0 B          ## next, grab the part header for the first attachment
      * ^\/--(.+$)+content-disposition:[  ]*attachment.*$(.+$)*
      { partHead=$MATCH  file=NameNotFound
        :0                       ## grab the filename for use in reporting
        * partHead ?? filename="?\/[^";]+
        { file=$MATCH }
        :0                          ## grab the boundary for use by filter
        * partHead ?? ^^--\/.+
        { boundary=$MATCH }
        :0                  ## grab the disposition line for use by filter
        * partHead ?? ^\/content-disposition.+
        { cdisp=$MATCH }
        :0 f b w
        | sed -n -e ":top;\\?^--$boundary?!{p;n;btop;}"     \
         -e ":hold;\\?^--$boundary--?{p;n;btop;}"           \
         -e "h;:head;n;\\?$cdisp?brepl;/./{H;bhead;}"       \
         -e "H;x;p;:print;n;\\?^--$boundary?bhold;p;bprint" \
         -e ':repl;x;P;a\
Content-Type: text/plain; charset=US-ASCII\
 \
Part dumped\
'        -e ":dump;n;\\?^--$boundary?bhold;bdump"
        :0 f h w
        | formail -A"X-Munged: removed attachment $file from message"
      }
    }

 3. If there was only one part (the attachment), we're OK. The recipe
    above has replaced that part with another valid part, so the headers
    should all be fine.

 4. If the message is nested, the multipart section may not be mentioned
    at all in the RFC822 headers of the original message. This means
    that the recipe in (2) won't detect it. A simple change handles
    that. Replace the first condition line of the recipe (H ??...) with
    this new line:

    * HB ?? ^content-type:.*multipart/\/[^  ;]+

    and it should work on all nested messages.

 5. Finally, how do we handle messages with multiple attachments. This
    requires a set of recursive recipes. The first recipe, in the
    calling rc file, does the initial detection, and the remaining
    recipes, in a separate rc file, do the bulk of the work.
    
Putting it all together, this may be what you asked for.

In the main file (usually .procmailrc, YMMV):

    :0
    * ^content-disposition:.*attachment($|;| |  )
    * ^content-.*filename="?\/[^";]+
    { file=$MATCH   sendNoMIME=yes
      :0 f h w
      | formail -imime-v -icontent-d -icontent-t -icontent-m -icontent-b
      :0 f b w i
      | echo "$file was here. It is gone."
    }
    :0 E B ## otherwise, look for a multipart with one or more attachments
    * HB ?? ^content-type:.*multipart/
    * content-disposition:.*attachment($|;| |  )
    { sendNoMIME=yes INCLUDERC=ReplacePart.rc
      :0 f h w
      | formail -A"X-Munged: removed attachment(s) $files from message"
    }
    :0
    * sendNoMIME ?? yes
    * other autoresponder conditions
    { autoresponder recipes }

And in the rc file ReplacePart.rc

    :0 B            ## next, grab the part header for the first attachment
    * ^\/--(.+$)+content-disposition:.*attachment[;  ]*$(.+$)*
    { partHead=$MATCH  files
      :0         ## grab the filename for use in reporting, append to list
      * partHead ?? filename="?\/[^";]+
      { files=${files:+$files, }$MATCH }
      :0                            ## grab the boundary for use by filter
      * partHead ?? ^^--\/.+
      { boundary=$MATCH }
      :0                    ## grab the disposition line for use by filter
      * partHead ?? ^\/content-disposition.+
      { cdisp=$MATCH }
      :0 f b w            ## filter to replace attachment by fixed message
      | sed -n -e ":top;\\?^--$boundary?!{p;n;btop;}"     \
       -e ":hold;\\?^--$boundary--?{p;n;btop;}"           \
       -e "h;:head;n;\\?$cdisp?brepl;/./{H;bhead;}"       \
       -e "H;x;p;:print;n;\\?^--$boundary?bhold;p;bprint" \
       -e ':repl;x;P;a\
Content-Type: text/plain; charset=US-ASCII\
 \
Part dumped\
'      -e ":dump;n;\\?^--$boundary?bhold;bdump"
      :0 a B      ## look for more attachments in body iff previous worked
      * content-disposition:.*attachment($|;| |  )
      { INCLUDERC=$_ }
    }

I leave the eradication of non-MIME attachments as an excercise.

Some notes on what these recipes do, how they do it, and what they don't
do, and what they can't do:

 1. They haven't been tested in production. The recipe for point 2,
    above, has been pasted into a test harness and fed controlled
    messages, and works. (Be sure to put tab characters into the right
    places, though).

 2. The non-multipart recipe (point 1) requires filename=value in a
    content header. The multipart recipes assume that filename=value
    is present in the MIME part header; if it isn't, the recipe should
    still work, but the reporting may look strange.

    Note that, unlike content-disposition headers, content-type headers
    can have comments. This means that there is a small chance that the
    multipart test in the first parts of these recipes set will suffer a
    false positive. C'est la vie.

 3. The formail call with all the content headers listed can be
    simplified if you know that there aren't any content-length headers,
    or are willing to delete one if it is there.

 4. The recipe in point 2 works as follows.

    . The top is sees if there is something in the message worth
      replacing, that is, something is defined in the head or body as
      multipart, and the body has at least one content disposition
      header. This doesn't guarantee that there is something there to
      replace, but it does say that it's worth looking.

    . The next test is the big one. It catches a content-disposition
      header along with the MIME part header lines which procede it,
      and the boundary line before those. If we can't find something to
      match this, the first test was a false positive.

    . On a hit, we parse the boundary, disposition header, and filename
      for later use. We then pass the boundary and disposition header to
      sed for the replacement action.

    . Sed is used to filter the message body, replacing each part which
      matches this target disposition line. It doesn't look for the
      particular part which the procmail regexp isolated, simply any
      part which has the same disposition line.

    . The sed filter may catch more than one attachment, in which case
      the file names are underreported.

 5. The sed routine may well need fiddling for your version of sed. Mine
    is gnu sed, and this works with it. Many seds don't abide anything
    after a label in a -e expression. Many don't want anything before an
    opening curlybrace. On one sed I had tried, I needed 19 expressions
    to do this. Part of that is my sed skill, part is the sed version.

    Here is a walkthrough of the sed script:

    . If we haven't hit a boundary yet, print the current line and go
      back for more. (Print everything from the start to the first
      boundary, but not the first boundary.)

    . If we hit the final boundary, print it out and go back for more.
      (Print everything from the final boundary to the end of the
      message.)

    . We hit a non-terminal boundary. Put it into the hold, clearing out
      whatever might be there.

    . Process the MIME part head: Get the next line. If it is our target
      disposition line, branch to the replace routine. If it isn't a
      blank line (its another head line) append it to the hold and go
      back for more head lines. Otherwise (its a blank line) we've
      reached the end of the part header and this one isn't the one
      we're after, so

    . Append the blank line to the hold, then retrieve and print the
      hold, then print every line until we get to another boundary line.
      When we get to a boundary, branch back to the boundary tests.

    . To replace the target part, first retrieve the hold so we can
      print the boundary line. Print just the boundary line. Then append
      an appropriate header, a blank line, and a fixed message. Then
      loop through more lines until we hit another boundary, and branch
      back to the boundary tests.

    What this sed script can't do:

    . Insert variable data in place of the removed part. I couldn't
      find a way to insert the file name or boundary variable, so I
      resorted to popping the boundary variable off the top of the hold
      and inserting constant text. Blank lines are appended with space
      backslash, I learned through experimentation and rapid hair loss.

    What this sed script doesn't do:

    . Save the attachment to a file. Actually, you could use sed w
      (write) commands to append the mime part header and body to a file
      if you want, but then you have to deal with generating unique
      names and the (unproven) cost of those individual writes. You also
      have to deal with the chance that the script will replace more
      than one part, and the complications that introduces for saving
      them.

  6. The recipes cannot catch defaulted MIME parts (parts with no MIME
    part headers after the boundry line). That isn't a problem here,
    since attachments require some of those headers.

Now that the question is answered, howzabout some discussion:

First, or course, is the question: "Why?" Why do you want to delete all
attachments? Many MUAs handle them quite well. Is there a problem with
attachments like this one:

    --boundaryline
    Content-Type: text/plain; charset=US-ASCII
    Content-Disposition: attachment;
        filename="stage2.txt"
    Content-Transfer-Encoding: base64

    QWxnb3JpdGhtOiBDYWVzYXIgQ2lwaGVyDQpPZmZzZXQ6IDE5DQoNCkNpcGhl
    cnRleHQ6DQpNSElMWSBMWkEgWkJITCBYQlBaWEJMIE1WWUFCVUhMIEhXV0FQ
    QlogSlNIQktQQlogSkhMSkJaIEtQSkFCVCBIWUpIVUJUIExaQQ0KVUxCQVlW
    VQ0KDQpQbGFpbnRleHQ6DQpGQUJFUiBFU1QgU1VBRSBRVUlTUVVFIEZPUlRV
    TkFFIEFQUFRJVVMgQ0xBVURJVVMgQ0FFQ1VTIERJQ1RVTSBBUkNBTlVNIEVT
    VA0KTkVVVFJPTg0K
    --boundaryline

If the problem is the size of the messages, you can more easily bounce
all large messages.

Maybe your complaint is about vcards and tnef and the like. Those can be
removed easily, to be sure, and many of us do that. You don't, however,
need a big stick to do that, and this is a big stick.

Anyway, that's how I might go about silencing the MIME, should I need
to. I'm sure that others on the list will have some comments. As a
matter of fact, I wouldn't try to implement this until a few of our
éminence grise have had their say.

-- 
Rik Kabel     Creating tomorrow's legacy systems, today     
rik(_at_)netcom(_dot_)com