Re: duplicates and delivery problems

Michael Helm <helm(_at_)fionn(_dot_)es(_dot_)net> wrote:

Ugly maybe expensive idea -- hash the body (maybe do something to
delete leading & trailing "junk", such as added mime wrappers).
Use SHA or MD5.  Save with an associated date (possibly in a
db scheme, for extra credit).   Check hash against a list of previously
seen hashes, if matches, count as a duplicate, otherwise add
hash to list.  Periodically flush hash list.

Maybe formail could be modified to do this & manage it like it can
manage the message-id duplicates.


The hashd program I wrote will do that. (It won't calculate a hash,
you do that anyway you want.) You can get it from:
        <URL:ftp://ftp.netusa.net/users/eli/proc-util.tgz>

Below is the README, people familiar with it will note the presence
of some new options, in particular the -x for time-stamped expiring
databases. That has not been tested as thoroughly as I would like,
but I am pretty sure it works. (In particular I am concerned about
resizing the database, so if you use a -s size larger than you
expect it to grow, there should be no problems.)

Elijah
------

                                hashd

Maintains a database of hashes and exits 1 or 0 based on the
presence or absence of a presented hash being in the database.
Intended as a general purpose replacement for 'formail -D'.

by Benjamin Elijah Griffin <eli(_at_)netusa(_dot_)net>  *  22 July 97

* Summary of usage:

    hashd dbfile [options or "--"] hashstring

Options:

    -!    invert exit code (does not effect exit 2 -- error condition)
    -l N  consider length N of hashstring significant (default 128)
    -t    if hashstring found in database, move it to top of list
    -d    don't add new hashes to database (useful with -t)
    -r    database is for read only, do not update with new hashstring
    -s N  use a database of N entries (default 128)
    -x N  use a database with entries that expire after N units of time
              default N is 1, default unit is weeks (units: smhdwy)
    -0    use null-terminated (formail -D style) database
    -v    be verbose
    --    end of options

Notes: hashd will try to fit the whole (length * size) db in memory.
       It is not a good idea to change the -l value for a db. Ugly
       things happen when hashstring has any \n's in it unless you are
       using a null-terminated (-0) database. The null-terminated
       format can read formail cache files and write files readible
       by formail, but the contents will be (logically) reordered.


* Usage in more detail:

hashd wants a database file (format described below), a non-optional list
of options (though '--' is accepted for 'no options') and a hashstring.
The name of the database file will always be the first argument and the
hashstring will always be the last argument. Please be mindful of quoting
conventions needed to prevent the hashstring from being interpreted as
multiple arguments by the shell.

hashd will load the database into memory and then try to find hashstring
in it. The hash must match exactly from begining to end for a match to be
considered successful, this includes leading and trailing whitespace. In
normal circumstances hashd will exit with a code of 0 if a match was found
and a code of 1 if none was found. The -! option will swap these. Various
error conditions before loading the database will cause exits with a code
of 2. After the database has been loaded any errors will produce status
messages but will not effect the exit code. (This may change in future
releases.)

The -l and -s options control how the database is represented in memory
and thus how it will be written back out. Changing the -s size of an
existing database is relatively harmless. It will grow or be truncated
as needed. Changing the -l size will cause more problems. Long lines
that had been truncated in the past cannot be restored (for making it
larger) and long lines in the database will be truncated (for making it
smaller). Expiring databases interpret -s and -l slightly differently,
see below.

hashd will usually try to write out the database again. Exceptions
occur when the database is read-only, a match is found but cycle to
top is not selected, or if a match is not found and don't add new is
selected. If the hashstring was not found, it will be placed on the top
of the list (and possibly push something off the bottom). If the
hashstring was found, the database will be rewritten with that line at
the top if -t is selected. Otherwise there is no need to change the
database file and it will quit without rewriting.

When the verbose option is specified, most diagnostics will be written
to stderr, but a few very verbose ones will be written to stdout. This
allows you to redirect stdout to /dev/null to get a less verbose
diagnostic.

The unless -0 is used, the database file is a regular file with one
hashstring per line. If you know what you are doing, you can create or
edit these files in your favorite text editor. The one hashstring per line
means that hashes themselves may not span mutliple lines. Also if the
hashes contain binary data, so will the database. (This might cause
problems editing it in your text editor.)

When -0 is used, the database is superficially similar to the cache files
used by "formail -D". Hashes are null terminated and the file ends with
a double null. This is the format of a formail cache prior to the file
"wrapping" around. File wraps occur in formail after it has reached the
size limit specified for it. When reading and writing null-terminated
databases, hashd makes no attempt to preserve the effects of this wrap
around format.

The database can also be configured to be expiring. In this mode there is
a timestamp entry associated with each hash. Any hash older than the
expiration time will be ignored and -- if the database is writable --
deleted. With an expiring database there is no fixed bounds on the number
of hashes in the database. The -s option will be used as the block size
to increment the database, unless you know you will be dealing with a
very large database there is no reason to change it. The -l size is still
used but ten is added to the value internally to allow room for the time
stamp. Timestamps are stored at the front of the hash in a deciminal
number representing seconds since the epoch that the entry was created. 
This means it is subject to the year 2038 problem on machines with 32bit
integer types.

On the command line the exiration is specified as a number with a trailing
unit of time. The default unit is the week. Units are the first initial of
"seconds," "minutes," "hours," "days," "weeks," or "years." The expiration
must be at least one second. A year is treated as 365 days.

* Some notes on compiling hashd:

My code is mostly neat and well commented, but is has some ugly bits
that will annoy c compilers that like to check types. These will,
hopefully, only produce warnings. Besides that there are a couple of
things used which are not universally implemented the same way. HP-UX
(if I remember correctly) does not have an "extern int errno;" in its
<errno.h>, so you may have to uncomment the line I have that statement
on for that system and similar ones. As far as I know, all systems
have at least one of bzero() and memset(), but not all have a particular
one of the two. There are two places where I have code for both, and
one of them is ifdef'ed out, use whichever you need by changing the macro
definitions in the Makefile.

* Some examples of using hashd:

Mostly I wrote this to be used in procmailrc files for dealing with
duplicate mail messages. The idea is some program generates a hash or
metric of a message's contents and then hashd will detect if it has
been seen before. The formail program from the procmail package does
this for message-IDs (and in a very limited fashion for email addresses)
already. 

I wanted something that would work for the metric of size of message
body, from address and subject to deal with repeated spams. Here's
how that can be implemented in procmail 3.10 and up with hashd.

   # Deal with duplicates, as determined by a metric of From: and Subject:
   # contents plus message size.

   # hashd will be modifying the database but does not use any locking
   # convetions, so use procmail's locking mechanism.
   LOCKFILE=mymtrc.lock

   # These will have leading whitespace.
   From    = `formail -x From:`
   Subject = `formail -x Subject:`

   # "B"ody only. Headers vary too much. Using ":0B" doesn't work, hence
   # this weird syntax. It is so easy to hate procmail for stuff like this.
   # Add this "* 1^0 ^^" to fix an off by one size problem, but does not
   # bother me for this.
   :0
   * B ?? 1^1 > 1
   { }
   Size     = $=

   # Order is selected to minimize long line truncation problems.
   Metric="$Size$From$Subject"

   # -s 300  size of dbfile
   # -l 200  length of line significant
   # -t      cycle matches to top
   :0
   * ? hashd mymtrc.cache -t -s 300 -l 200 "$Metric"
   { 
      # We found a dupe, log that fact. (Assumes LOGFILE set earlier
      # in the procmailrc.)
      LOG = "size-from-subject duplicate: "

      # Now write it out to a special mailbox
      :0:
      duped-mail
   }

   # Free the lock
   LOCKFILE

Here's another example as a replacement for fgreping a list of friends.

   # File of email addresses of friends, one per line.
   Friends = friend.list

   # Extract address from the "Return-Path:" header (this is added by the
   # receiving mailer on some systems on others use the "From " line).
   :0
   * ^Return-Path: *<\/[^>]*
   { 
      FromAddr = $MATCH

      # -r      read only database
      # -s 500  any size larger than the list
      # -l 40   any size larger than the longest email address
      :0
      * ? hashd $Friends -r -s 500 -l 40 $FromAddr
      {
         # From a friend, log that fact. (Assumes LOGFILE set earlier
         # in the procmailrc.)
         LOG = "found in friend list: "

         # Now write it out to friends mailbox
         :0:
         friends-mail
      }
   }

Here's similar example as a replacement for fgreping a list of bozos.

   # File of email addresses of friends, one per line.
   Bozos = bozos.list

   # Extract address from the "Return-Path:" header (this is added by the
   # receiving mailer on some systems on others use the "From " line).
   :0
   * ^Return-Path: *<\/[^>]*
   { 
      FromAddr = $MATCH

      # -r      read only database
      # -!      invert exit code
      # -s 500  any size larger than the list
      # -l 40   any size larger than the longest email address
      :0
      * ? hashd $Bozos -r! -s 500 -l 40 $FromAddr
      {
         # Not from a bozo, log that fact. (Assumes LOGFILE set earlier
         # in the procmailrc.)
         LOG = "was not found in bozos list: "

         # Now write it out to friends mailbox
         :0:
         friends-mail
      }
   }

Here is a way to send vacation notices once to each person who
sends you mail during your 10 day break.

   From = `formail -rz -x To:`
   LOCKFILE = vacation.lock

   # -!      invert exit code
   # -l 60   allow for some long address to not get truncated
   # -x 10d  keep stuff around for ten days
   :0 Wi
   * ? hashd vacation.cache -! -l 60 -x 10d "$From"
   {
      # Have not sent a message to this person yet. (Assumes LOGFILE set
      # earlier in the procmailrc.)
      LOG = "notifying about my vacation: "

      # Now send a reply off, keeping a copy for me.
      :0hc
      | ( formail -r ; echo ; echo "I will be away from July 11th to 20th" ;\
          echo "and will bot be able to respond to mail" ; \
          echo "during that period. Sorry." ) | $SENDMAIL -t
   }

Here is a killfile method that lets you easily see what stuff is not
being used anymore. Any time something is matched, that gets moved
to the top, so old and useless stuff will fall down to the bottom
(end) of the file.

   From = `formail -rz -x To:`
   LOCKFILE = killfile.lock

   # -l 60   allow for some long address to not get truncated
   # -d      don't add non matches
   :0 Wi:
   * ? hashd killfile.cache -! -l 60 -d "$From"
   killed-mail

These two recipes are roughly equivilent. If in the unlikely event
you got a whole bunch of messages with 500 byte message-Ids, formail
would fit about sixteen of them into its cache file while hashd
would truncate them all to 40 bytes while keeping its database at
200 entries.

   # Recipe one (from procmailex(5))
      :0 Wh: msgid.lock
      | formail -D 8192 msgid.cache

   # Recipe two
      :0 Wi: msgid.lock
      * ^Message-id:\/.*
      | hashd msgid.cache -0 -s 200 -l 40 "$MATCH"