Re: Duplicate Message Clean-up

I can take the 16,000 individual messages and add them to separate (empty)
archive.  By
doing this it weeds out all the extra messages and only the 4200 messages
appear in the
archive.  The problem with this is MHonArc can not properly mandle the
messages in this
form (HTML).

The messages got duplicated due to the fact that I never empty the source
mailbox and the
archive database got reset.  So, when it went to do its daily processing it
interpreted all
the messages in the mailbox as new messages.  This ended up creating all of
the duplicate
message files.  Though only 4200 messages show up via the scan command.

Jay

http://www.shadow-lands.com/orb
http://www.shadow-lands.com/sml

----- Original Message -----
From: "Earl Hood" <ehood(_at_)hydra(_dot_)acs(_dot_)uci(_dot_)edu>
To: <mhonarc(_at_)ncsa(_dot_)uiuc(_dot_)edu>
Sent: Thursday, September 06, 2001 1:38 PM
Subject: Re: Duplicate Message Clean-up

On September 6, 2001 at 07:35, "James Roman" wrote:

I have over 16,000 message files in my current database and only about
4,200 of those messages are valid.  The rest are all duplicates.  I have
read the messages about testing for duplicates and I know what the
problem was.  Now for the hard part.

Does anyone know how I can quickly clean-up all of the extra messages?


Since the "dups" have separate message-ids (which must be the case
since MHonArc would have prevented the duplicates from being archived
if message-ids matched), the real task is gathering the list of dups.
The RMM resource can be used to remove messages, but you must come up
with the list.

A possible approach to your problem is to write a Perl script that
takes each message and computes its MD5 checksum (there is Perl module
that does MD5 checksums), BUT only computing the checksum for the data
between the following comment declarations in each message page:

<!--X-Head-of-Message-->
...
<!--X-Head-of-Message-End-->
and,
<!--X-Body-of-Message-->
...
<!--X-Body-of-Message-End-->

Just maintain a hash where the keys are the MD5 checksums and the
values are the files.  Therefore, when a checksum is computed for a
message page, the hash can be checked to see if there is another file
that has the same checksum.  If so, you have a duplicate.

Note, the above is only useful if the dups you are talking about are
real dups, i.e. byte-for-byte they are the same with the exception of
the message-ID given to them.  Also, the script logic could be
complicated if you have dups, but some other inconsequential message
headers could vary.  If so, you may have to play with how you handle
the <!--X-Head-of-Message--> part.

--ewh