procmail
[Top] [All Lists]

Re: On discouraging direct replies

1997-10-27 12:12:08
Any good afficionado of procmail should have a duplicate removing filter
in their personal filter toolkit.

The problem then becomes not losing mail from those stupid mailers that
don't generate unioque message IDs, as I have recently been complaining
on this list. I may go to just content based duplicate checks.

Check out this recipe file, from my procmail library.

Just set "dupcheck_use_md5".

# dupcheck.rc
#
#    Copyright (C) 1997  Alan K. Stebbens <aks(_at_)sgi(_dot_)com>
#
#    This program is free software; you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation; either version 2 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program; if not, write to the Free Software
#    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# Usage:
#
#    INCLUDERC=dupcheck.rc
#
# If the current mail has a "Message-Id:" header,  run the
# mail through "formail -D", causing duplicate messages to
# be dropped.
#
# If the mail does not have a "Message-Id", or if the variable
# "dupcheck_use_md5" is set, then run the body through MD5SUM (defaults
# to "md5sum"), causing duplicate messages to be dropped.  However,
# even if "dupcheck_use_md5" is set, the "formail -D" filter will
# still be applied if a Message-Id: exists.
#
# Currently, the only way to use *only* an "md5sum" duplicate check
# is to remove the Message-Id: before invoking this recipe file.
# Something like this:
#
#       :0fh            # remove the Message-Id:
#       | formail -IMessage-Id:
#       INCLUDERC=dupcheck.rc
#
# In my opinion, however, an MD5 checksum should be used in addition to using
# the Message-Id:, not instead of.  The MD5 checksums can be used to
# detect and avoid redistributed or resent messages.
#
# The variable MD5SUM can be set to the program to perform the checksum
# on the message body.  By default, it is set to "md5sum".
#
# This recipe has been enhanced with a "fail-safe" algorithm to avoid
# losing mail which has been "soft-failed" by some later part of the
# user's recipe file.  This algorithm is only applied if the variable
# "dupcheck_failsafe" has been set to the number 1 or 2.
#
# There are two methods available to accomplish this, and selected by
# the variable "dupcheck_failsafe".
#
# 1.  the simple way: remove the checksum files when procmail exits with
#     an exitcode of 75.  This is done by using a TRAP command.
#
#     The disadvantage is that when any mail "soft fails" for any
#     reason, duplicate mail arriving after the soft-failure for mail
#     originally received before the soft-failure will not be detected.
#
#     An advantage of this method is its simplicity.  The worst case is
#     that a few duplicates may not be detected.
#
# 2. the hard way: using an additional "pending" log file, when a new
#    mail arrives and passes the checksum filters the first time, add it
#    to the "pending" file.  Using a TRAP command, remove the processed
#    mail from the "pending" file, unless the exitcode was 75
#    (EX_TEMPFAIL).  If the mail fails the checksum commands, but is
#    still in the pending file, then do NOT drop the mail as a
#    duplicate, and leave it in the pending file.  Eventually, when the
#    mail is finally delivered (by any means), it will be removed from
#    the "pending" file.
#
#    The disadvantage with this method is its complexity: an additional
#    "pending" file is needed, and there is a small performance hit for
#    each mail, in order to maintain the pending file.
#
#    The advantage of this method is its completeness: all duplicates
#    will still be detected and dropped.
#
# Both methods require the use of the TRAP command; they set additional
# commands into the TRAP variable.  If the TRAP command has already been
# set, the new commands are added to the list of commands.
#
# *** Warning: if the user sets the TRAP command in a later recipe, care
# should be taken to avoid losing the commands installed by this recipe.
# The way to set TRAP with a new command, without losing the existing
# ones, if any is:
#
#   TRAP="${TRAP:+${TRAP}; }new-command args ..."
#
# All the default filenames used by this recipe are relative to the
# current directory, $MAILDIR.

msgids=.msgids                  # where we keep the message id's
md5sums=.md5sums                # where the md5 checksums go
pendingids=.pendingids          # pending mail cache

dupcheck_failsafe=${dupcheck_failsafe:-2}
                                # set to 1, 2, or anything else,
                                # to indicate which method is 
                                # preferred. Defaults to 2.
                                # If unset or not 1 or 2, then *no
                                # failsafe* method is applied.

OLDCOMSAT=${COMSAT:-off}        # Don't tell COMSAT anything
COMSAT=off

# To keep the complexity of this recipe down, we'll separate the
# failsafe methods

:0                              # methods 1 or 0 (none)
* !dupcheck_failsafe ?? ^^2^^
{
  :0 Wh: $msgids.lock           # is there a Message-Id:?
  * ^Message-Id: *\/[^ ].*
  | formail -D 16384 $msgids
                                # mail is not a duplicate..
  :0 e                          # passed formail's check; failsafe it?
  * dupcheck_failsafe ?? 1
  { TRAP="${TRAP:+${TRAP}; }test \$EXITCODE -eq 75 && rm -f $msgids" }

  :0                            # derive a checksum if needed or requested
  * 1^0 !^Message-Id:
  * 1^0 dupcheck_use_md5 ?? .
  { 
    # compute md5sum on the body
    # Convert and squeeze all whitespace and then MD5sum the file
    :0 b                        # scan only the body
    MD5CKSUM=|tr -s '\012\008 ' '   ' | ${MD5SUM:-md5sum}       

    LOCKFILE=$md5sums.lock      # lock $md5sums
    :0 aWhi                     # see if this is a duplicate message 
    |fgrep -s "$MD5CKSUM" $md5sums # if fgrep succeeds, we've tossed the mail
                                # Hurray! The mail is not a duplicate!
    :0 echi                     # add the new checksum to the file
    |echo "$MD5CKSUM" >>$md5sums
    LOCKFILE                    # unlock $md5sums

    :0                          # see if the msg already has a header
    * ^X-MD5-Checksum: \/[^ ].*
    * $? test "$MD5CKSUM" != "$MATCH"
    { insert_opt='i' }          # save the old flags on mismatch
    :0 fh                       # insert (or replace) current checksum
    |formail -${insert_opt:-'I'}"X-MD5-Checksum: $MD5CKSUM"
    :0                          # any failsafe?
    * dupcheck_failsafe ?? 1
    { TRAP="${TRAP:+${TRAP}; }test \$EXITCODE -eq 75 && rm -f $md5sums" }
  }
}
:0 E                            # fail-safe method 2: 
* dupcheck_failsafe ?? ^^2^^
{ # With method 2, we must maintain a file of "pending" mail using this
  # recipe and a command in the TRAP variable.

  :0 chi:$pendingids.lock       # ensure that the pending cache is writable
  * !?test -w $pendingids
  | rm -f $pendingids ; touch $pendingids

  pending                       # clear var
  LOCKFILE=$pendingids.lock     # lock $pendingids
  # If there's a Message-Id and the mail is not pending
  :0                            # is there a message-id?
  * ^Message-Id: *[^ ]\/.*
  { MSGID=$MATCH                # save for later
    :0                          # see if pending
    * $!?fgrep -s '$MATCH' $pendingids
    { :0 Wh:$msgids.lock        # see if the mail is a duplicate
      |formail -D 16384 $msgids

      :0 chi                    # it's not; update the pending cache
      |echo "$MSGID" >>$pendingids

    }
    :0 E                        # it is a pending mail 
    { pending=y }               #  mark it so
    # Update the TRAP command list to remove the msgid from the
    # pending cache on normal exits
    TRAP="${TRAP:+${TRAP}; }\
          if test \$EXITCODE -ne 75 ; then \
            fgrep -v '$MSGID' $pendingids >$pendingids.new ; \
            mv $pendingids.new $pendingids ; \
          fi"
  }
  LOCKFILE                      # unlock $pendingids

  # if not already pending, derive a checksum if needed or requested
  :0
  * 1^0 !^Message-Id:
  * 1^0 dupcheck_use_md5 ?? .
  { :0 b                        # checksum the desired headers and body
    # Squeeze duplicate whitespace and then checksum the file
    MD5CKSUM=|tr -s '\012\008 ' '   ' | ${MD5SUM:-md5sum}
    LOCKFILE=$pendingids.lock # check the pending cache
    :0                          # make sure it's not pending
    * ! pending ?? y    
    * $!?fgrep -s '$MD5CKSUM' $pendingids
    { :0 Whi:$md5sums.lock      # see if mail is an MD5 duplicate
      |fgrep -s "$MD5CKSUM" $md5sums
                                # Hurray! the mail is not a duplicate
      :0 chi:                   # add the new checksum to the file
      |echo "$MD5CKSUM" >>$md5sums
      :0 achi                   # update the pending cache
      |echo "$MD5CKSUM" >>$pendingids
    }
    LOCKFILE                    # maybe pending now
    # Either we've updated the pending file, or the checksum was 
    # already in it.  So, now we update the TRAP command list to
    # remove the checksum on normal exits.
    TRAP="${TRAP:+${TRAP}; }\
          if test \$EXITCODE -ne 75 ; then \
            fgrep -v '$MD5CKSUM' $pendingids >$pendingids.new ; \
            mv $pendingids.new $pendingids ; \
          fi"
    :0                          # see if the msg already has a header
    * ^X-MD5-Checksum: \/[^ ].*
    * $? test "$MD5CKSUM" != "$MATCH"
    { insert_opt='i' }          # save old headers on mismatch
    :0 fh                       # in any case, set the new checksum
    |formail -${insert_opt:-'I'}"X-MD5-Checksum: $MD5CKSUM"
  }                             # end md5 check
}
COMSAT=$OLDCOMSAT               # set COMSAT back to original value
OLDCOMSAT

<Prev in Thread] Current Thread [Next in Thread>