nmh-workers
[Top] [All Lists]

[Nmh-workers] enhancement to mhfinddup

2008-09-09 13:11:35

I'm paranoid...at least I've got what I believe to be a healthy paranoia about 
accidently deleting mail.

Recently, I ended up in a situation (due to a vacation, accessing mail remotely
from different mail clients, using the "leave on server" option to fetchmail,
etc.) where I've got about 4000 duplicated mail messages.

I started to use Bill Wohler's mhfinddup script[1] to delete the duplicates,
then my caution took hold...in addition to using the Message-Id (which "should"
be "unique", but since it may be client-generate, which could have collisions),
I decided to add another test for duplicates to mhfinddup. Using formail(1) and
perl's Digest:MD5 module, the script now tests for a duplicated checksum in the
body of the message if the Message-Id's are the same. The decision to test only
the message body is to ensure that messages which only vary by date stamps in
the headers are treated as duplicates.


Below you can find the diff (suitable for feeding to patch(1)) against 
mhfinddup 1.2.

Thanks to all who help keep [""|n|ex]mh alive and well.

Mark

        [1] http://www.faqs.org/faqs/mail/mh-faq/part1/section-125.html

------------------ diff to add MD5 checksum test to mhfinddup --------------
***************
*** 138,150 ****

  # Packages and pragmas.
  use Getopt::Long;

  use strict;

  # Constants.
  my $cmd;                                # name by which command called
  ($cmd = $0) =~ s|^\./||;                # ...minus the leading ./
! my $ver = '$Revision: 1.1 $';         # program version with CVS noise
  $ver =~ s/\$//g;                        # strip dollar signs
  $ver =~ s/Revision://;                  # strip CVS keyword
  $ver =~ s/\s//g;                        # strip whitespace
--- 138,151 ----

  # Packages and pragmas.
  use Getopt::Long;
+ use Digest::MD5 qw(md5_hex);

  use strict;

  # Constants.
  my $cmd;                                # name by which command called
  ($cmd = $0) =~ s|^\./||;                # ...minus the leading ./
! my $ver = '$Revision: 1.2 $';         # program version with CVS noise
  $ver =~ s/\$//g;                        # strip dollar signs
  $ver =~ s/Revision://;                  # strip CVS keyword
  $ver =~ s/\s//g;                        # strip whitespace
***************
*** 195,201 ****
                $msgs{$msgid} =~ m|^\+(.*)/(\d+)$|;
                my($f, $m) = ($1, $2);
                if ($folder eq $f || $no_same_folder) {
!                   handle_dup($f, $m, $folder, $msg);
                }
            } else {
                $msgs{$msgid} = "+$folder/$msg";
--- 196,226 ----
                $msgs{$msgid} =~ m|^\+(.*)/(\d+)$|;
                my($f, $m) = ($1, $2);
                if ($folder eq $f || $no_same_folder) {
!                       # it looks like we've got a duplicate...let's be sure 
by doing a MD5
!                       # checksum of the message body + Subject + From headers
!                       ######## Get the checksum from message 1
!                       my $folderpath=`mhpath +$f`;
!                       chomp($folderpath);
!                       open(FORMAIL,"formail -k -X From: -X Subject: < 
$folderpath/$m|") or die "Could not open pipe from \"formail -k -X From: -X 
Subject: < $folderpath/$m\": $!";
!                       my @msgbody=<FORMAIL>;
!                       close(FORMAIL) or die "Could not close pipe from 
\"formail -k -X From: -X Subject: < $folderpath/$m\": $!";
!                       my $sum1=md5_hex(@msgbody);
!
!                       ######## Get the checksum from message 2
!                       $folderpath=`mhpath +$folder`;
!                       chomp($folderpath);
!                       open(FORMAIL,"formail -k -X From: -X Subject: < 
$folderpath/$msg|") or die "Could not open pipe from \"formail -k -X From: -X 
Subject: < $folderpath/$msg\": $!";
!                       @msgbody=<FORMAIL>;
!                       close(FORMAIL) or die "Could not close pipe from 
\"formail -k -X From: -X Subject: < $folderpath/$msg\": $!";
!                       my $sum2=md5_hex(@msgbody);
!                       if ( $sum1 eq $sum2 )
!                       {
!                       handle_dup($f, $m, $folder, $msg);
!                       }
!                       else
!                       {
!                               printf STDERR "That's odd...messages \"$f/$m\" 
and \"$folder/$msg\" have the same Message-Id but different checksums\n";
!                       }
                }
            } else {
                $msgs{$msgid} = "+$folder/$msg";
----------------------------------------------------------------------------


-----
Mark Bergman    Biker, Rock Climber, Unix mechanic, IATSE #1 Stagehand

http://wwwkeys.pgp.net:11371/pks/lookup?op=get&search=bergman%40merctech.com

I want a newsgroup with a infinite S/N ratio! Now taking CFV on:
rec.motorcycles.stagehands.pet-bird-owners.pinballers.unix-supporters
15+ So Far--Want to join? Check out: http://www.panix.com/~bergman 



_______________________________________________
Nmh-workers mailing list
Nmh-workers(_at_)nongnu(_dot_)org
http://lists.nongnu.org/mailman/listinfo/nmh-workers

<Prev in Thread] Current Thread [Next in Thread>