On September 6, 2001 at 07:35, "James Roman" wrote:
I have over 16,000 message files in my current database and only about
4,200 of those messages are valid. The rest are all duplicates. I have
read the messages about testing for duplicates and I know what the
problem was. Now for the hard part.
Does anyone know how I can quickly clean-up all of the extra messages?
Since the "dups" have separate message-ids (which must be the case
since MHonArc would have prevented the duplicates from being archived
if message-ids matched), the real task is gathering the list of dups.
The RMM resource can be used to remove messages, but you must come up
with the list.
A possible approach to your problem is to write a Perl script that
takes each message and computes its MD5 checksum (there is Perl module
that does MD5 checksums), BUT only computing the checksum for the data
between the following comment declarations in each message page:
Just maintain a hash where the keys are the MD5 checksums and the
values are the files. Therefore, when a checksum is computed for a
message page, the hash can be checked to see if there is another file
that has the same checksum. If so, you have a duplicate.
Note, the above is only useful if the dups you are talking about are
real dups, i.e. byte-for-byte they are the same with the exception of
the message-ID given to them. Also, the script logic could be
complicated if you have dups, but some other inconsequential message
headers could vary. If so, you may have to play with how you handle
the <!--X-Head-of-Message--> part.