procmail
[Top] [All Lists]

Re: perl script to remail contents of mbox file

2004-06-10 20:37:51
At 14:41 2004-06-10 -0400, Jeff A. Earickson wrote:

be used to recover from procmail rule disasters, or it can be used to
forward already-delivered email to another address.  We hope that
procmail sysadmins find this script useful.

I belive you meant to say that you hosed a rule in /etc/procmailrc, not "my procmail rules" (since redelivery to YOURSELF is an ultra-trivial matter). You have a problem in that the messages as stored won't contain any information positively identifying who they were intended for. If you rely on the To: or a "Received: ... for" header giving you this information, you're going to be subject to a LOT of false information, and very likelt pissing off a lot of people (including people on mailing lists to which you may redeliver messages). Multiple recipients and BCC's in particular are issues.

If you need to recover from THAT sort of catastrophe, I'd suggest leaning on the system maillog as a resource for sorting who a message was originally intended for.

        formail -s procmail -m i_fscked_up.rc < thrashedmail.mbx

(noting that thrashedmail.mbx should NOT be the mailbox located in the same location that your /etc/procmailrc (or any other procmailrc for that matter) would be delivering mail into, lest you generate yourself an endless loop.)

The i_fscked_up.rc would:

* take the topmost received header and extract the (E)SMTP message ID from it. For example, the message to which I am replying arrived at my host with the following topmost received header:

Received: from ms-dienst.rz.rwth-aachen.de (ms-1.rz.RWTH-Aachen.DE [134.130.3.130])
        by mailhost.domain.tld (8.12.10/8.12.10) with ESMTP id i5AIrbh9015749
        for <my_address(_at_)some(_dot_)domain(_dot_)tld>; Thu, 10 Jun 2004 
11:53:38 -0700

That "i5AIrbh9015749" bit is of interest to us.

You could use formail to get this header, but you'll still need to process it further to get the SMTP ID from it, and since the topmost recieved header should be locally inserted AND contain your local mailhost name, the following expression should grab it handily:

:0
* ^Received:.*by mailhost\.domain\.tld.* with E?SMTP id \/[a-z0-9]+
{
        SMTP_ID=$MATCH
}


Using that ID, the recipe would then grep your maillog file. Optimally, you might pre-process your maillog to strip it down to smtpids and their related local recipients (which could be an external process), and thus finding a list of recipients would be a simple one-line grep operation. We won't assume that's been done, and instead will simply do all the work here. Note this is NOT going to be a light CPU load. Then, someone fscked up, and this shouldn't need to be run more than the one big time, right?

A simple grepage for this ESMPT id would return something like:

Jun 10 11:53:38 trei sm-mta[15749]: i5AIrbh9015749: from=<procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE>, size=10158, class=-30, nrcpts=1, msgid=<Pine(_dot_)GSO(_dot_)4(_dot_)58(_dot_)0406101432170(_dot_)2341(_at_)garnet>, proto=ESMTP, daemon=MTA, relay=ms-1.rz.RWTH-Aachen.DE [134.130.3.130]

Jun 10 11:53:43 trei sm-mta[15750]: i5AIrbh9015749: to=<my_address(_at_)some(_dot_)domain(_dot_)tld>, delay=00:00:05, xdelay=00:00:05, mailer=local, pri=94465, dsn=2.0.0, stat=Sent


Hopefully your MTA uses SMTP ids and logs something worthwhile - if not, then this approach won't work, and you really should look at switching to a worthwhile MTA...

That first log line merely comes up in the trivial grep operation - we don't need it (and the revised grep below will in fact not return it in the result set), but it's worth noting the nrcpts= token in there -- where that is >1 is where a conventional use-the-message-itself fixup approach would have significant problems (besides the other issues with that approach).

The to= bit and the mailer=local bit are the significant bits of the second log line. mailer=local means they'd be messages which were locally delivered and thus (assuming procmail is your LDA), subject to your /etc/procmailrc. Deliveries to file aliases and remote users would be excluded, since THEIR copies wouldn't have been affected and they won't show a mailer=local, even though they were received and processed by your mail host.

So, our grep operation might be:

        RECIP_RAW=`grep "$SMTP_ID: to=.* mailer=local" /var/log/maillog`

Thankfully, as only local deliveries should have been affected, you needn't utilize sendmail to affect re-delivery and thus do not need to recover the original envelope sender data (available in the From_ line), which is doable, but just more work.

However, sendmail still works into the equation, as per below.

The above still will not ven this method will not positively identify each true local recipient specified in the envelope -- where somone (usually a spammer, but let's say you have some virtual domains where someone sorts multiple mailboxes from their one mailbox) specifies multiple addresses which resolve to the same local user, only ONE of those addresses will be logged to the maillog (and only ONE message should actually be delivered). This isn't really a problem, since the MTA itself discarded the additional recipient copies, so you're no worse off than what the MTA was doing in the first place.

In the event of multiple recipients, a raw grep would find more headers, but the above grep would still focus on the actual _local_ ones.

Since there is a separate log line for each recipient locally delivered to, a simple sed operation tacked onto the above greppage will clear the cruft:

        | sed -e "s/^\(.*to=<\)\([^>]*\)\(>, .*\)$/\2/"

This nets us raw recipient addresses, one per line (a little more sed scripting, and they'd be on one line, but this isn't necessary at this stage).

Now, what you have is a list of recipient ADDRESSES. But locally, you want the userids. Invoke sendmail with these, in "address verification mode", which will expand the addresses, shown here with the followup scrubber sed operations:

    RECIP_INTER=`sendmail -bv $RECIP_RAW \
      | sed -e "s/^\(.*deliverable: mailer local, user \)\(.*\)$/\2/" \
      | sed -e :a -e '$!N' -e '$!ba' -e 's-\n-\ -g'`


The resulting variable contains the local usernames to which the message should be delivered to.

There is one issue, and that's dupes - your procmailrc will have shuttled multiple copies of messages - one for each local recipient. You should be able to clear those out using a messageid cache, more or less straight from the procmailex manpage:

        # you fscked up - the bigger the source mailbox, the bigger the cache
        # should be in order to ENSURE you're not duplicating message
        # deliveries.

        :0 Whc: msgid.lock
        | formail -D 100000 msgid.cache

        :0 a:
        duplicates.mbx

There is the remote chance that a message (say, from a mailing list) will have the same messageid and be destined for multiple users handled at your host - say because the recipients have different domains and different backup MX's and your host wasn't immediatley reachable when the list message was delivered (or someone is using a mail forwarder, etc), and as a result each of them will have a separate smtpid associated with them. So, we need to formulate a way to cache based on the smtpid rather (or in conjunction with) the messageid. This works to eliminate dupes caused by the redelivery aspect, but also ensures that each recipient will receive whatever number of copies they would have originally (since we don't actually weed out the dupes they would have received originally).

Of course, if we're using the SMTPID for dupe checking, most all duplicates should be near consecutive to one another, so that larger msgid cache issue is moot: a small one will suffice (plus, our SMTPIDs are MUCH shorter, and a given filesize will accomodate 3-5 x as many cache entries).

You'd of course do this before wasting your time with the rest of the recipient identification operations, sans the SMTPID extraction.


Final delivery would be:

        procmail -d recipient recpipient recipient


So, let's thread all of that together into one big happy rcfile (look ma, no perl!). I know the following to work, at least for sendmail as instlaled on my hosts, since I ran a test of it today after I'd written it:

# BEGIN i_fscked_up.rc

# to reprocess a misdelivered mailbox.
# invoke with something like:
#
#  formail -s procmail -m special_rcfile.rc < thrashedmail.mbx
#
# should be invoked as root (necessary for access to /var/log/maillog as well
# as procmail -d)
#

# you're a schmutz if you need to be running this script in the first place,
# so it'd make sense to log the hell out of what happens when you have to
# run it, in case you fsck up again.
LOGFILE=i_fscked_up.log
VERBOSE=on

# just as it appears in the received: headers on the fscked-up deliveries
MY_MAILHOST="your_mailhost_name"

:0
* ^Received:.*by $\MY_MAILHOST.* with E?SMTP id \/[a-z0-9]+
{
  SMTP_ID=$MATCH

  # Unfortunatley, formail doesn't process in argument order, and thus
  # msgid can't be rewritten in a single invocation.
  :0 Whc: msgid.lock
  | formail -I"Message-Id: <$SMTP_ID>" | formail -D 8192 msgid.cache

  :0 a:
  duplicates.mbx

  # Now, grep the maillog (or an archived copy of it)
  # for the SMTPID as it pertains to local deliveries of the message.

  RECIP_RAW=`grep "i5ALtpQS024861: to=.* mailer=local" /var/log/maillog \
    | sed -e "s/^\(.*to=<\)\([^>]*\)\(>, .*\)$/\2/"`

  :0
  * ! RECIP_RAW ?? ^^^^
  {
    # note we're invoking sendmail directly here - if you use a different
    # MTA, things are likely to be radically different (hell, this whole
    # recovery approach might not work for you at all).
    RECIP_INTER=`sendmail -bv $RECIP_RAW \
      | sed -e "s/^\(.*deliverable: mailer local, user \)\(.*\)$/\2/" \
      | sed -e :a -e '$!N' -e '$!ba' -e 's-\n-\ -g'`
  }

  # for testing purposes (I dunno, like the FIRST time you run this script
  # before you're POSITIVE it'll work for your system), you could shuttle
  # this to one mailbox file just so you can see what messages are
  # ultimatley identified as deliverable to some user.
  # alternately, prefix the command with echo, and add the 'i' flag
  :0
  * ! RECIPS_INTER ?? ^^^^
  | procmail -d $RECIPS_INTER
}

# anything not matching a messageid, or which failed to result in recipients
# being extracted from the logfile, will end up here, as well as delivery
# failures by the procmail invocation above.  Store in a mailbox for manual
# examination.  Most likely cause for ending up here is messages which are
# older than the maillog reflects.

:0:
unhandled_redelivery.mbx

# END i_fscked_up.rc



FTR, a majordomo listprocessor frontend I wrote makes a backup of messages, and in that, it saves the parameters which were passed to procmail. This allows for simple recovery from list problems (all lists are archives to the same backup file), by formail splitting the edited backup against a simple recovery script. This alleviates the problems associated with screwups, and doesn't require special access to logfiles and the like. It is also terribly faster than the above script will end up being when run against a large spool of messages...


NOW, nobody can say there's not a writeup of the proper way to recover from an /etc/procmailrc filing fsckup.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>