Is the goal to delete messages with the same subject line (but which may
have different bodies), or messages that are fully duplicates (so same
body, subject line, and most other headers)? "Duplicate" in the second case
is a lot harder as you could have messages that the received headers are
different but which are otherwise the same. To handle that case, I'd think
you'd want to do:
1) Use scan or something similar to find messages with the same subject
2) Use a custom scan template (or resort to grep) to find messages within
the previous set that have duplicated headers (presumably, to, from,
subject, and perhaps a few others).
3) Within any duplicates that have passed test 2, then use mhstore or the
like to extract the bodies, and use md5 or cmp to verify the bodies are the
same too.
On Sun, May 3, 2020 at 9:19 PM Ken Hornstein <kenh@pobox.com> wrote:
I know that 'sortm -textfield Subject' will sort messages accoring to
the subject field. Having run that command, is there a way to then
delete the first duplicate of each message in the list such that if 1
and 2 are duplicates and 6 and 7 are duplicates you would delete messages
2 and 7 leaving 1 and 6?
I want to say you could do something with piping the output of scan
into "uniq -d -f <num>". Might require a custom scan format, but that
seems relatively simple.
Hm, a quick test:
% scan -format '%(msg) %{subject}' | uniq -d -f 1
suggests that it prints the first one, not later ones, so that isn't
exactly what you want. Might be a good starting point, though? You could
probably do something with uniq -c and pipe that to an awk script that
did what you wanted.
--Ken