procmail
[Top] [All Lists]

Re: Simple recipe to move uninteresting threads in separate mailbox

2006-12-30 15:52:00
At 18:02 2006-12-30 +0100, M. Fioretti wrote:

<455442DB(_dot_)3030502(_at_)unix(_dot_)sbg(_dot_)ac(_dot_)at>
<20061115131249(_dot_)GA4511(_at_)masterpost>
<20061203175823(_dot_)GL3261(_at_)aragorn(_dot_)home(_dot_)lxtec(_dot_)de>

etc....

that file will be emptied every two weeks by a cron job.

Er, that seems unfortunate - the messages you'd deprecated the day before 
the cron job won't have any effect on future messages because they'll be 
purged because of an arbitrary 2 week cycle.

If the new entries are appended to the file, you might try using 'tail' 
instead, so you're discarding the topmost (oldest) messages beyond some 
keeper threshold, and thus the file doesn't grow to insane 
proportions.  Your script could even use 'wc' to determine if the file if 
beyond a threshold necessitating such trimming in the first place.

If you manipulate the data file, you should probably use locking strategies 
to allow for your MUA to append, your cron to shorten, and your procmail to 
read and append.  Logically, you can eliminate the cron entirely: as the 
file is intended to affect how procmail handles new messages, you can have 
procmail decide when to resize it.

I am trying to write a recipe that does this:

if (new message has In-Reply-To header with a Message-Id contained in
    .irrelevant_threads)

    (with "grep -qFf" or similar systems which don't assume the
     existance of databases, perl modules and so on)

        append the Message-Id of the new message to .irrelevant_threads
        save the new message only in $MAILDIR/irrelevant_list_threads

how would you write this recipe? Frankly, I've never tried something
so complex (for me of course) so I'd really appreciate your help here!

formail messageid cache comes to mind, but you'd need to diddle with header 
names.  I know I've written some recipes like this in the past (in response 
to queries on this list), but they're not in my archive of test recipes, so 
apparently I didn't tinker with them on my host.

Any possibility that the method you're using to have your MUA add the ID to 
the file could be modified to use formail?  If you did this, then formail 
could manage your cache filesize automatically for you AND the replies to 
those replies would also be ignored (which are threads you're not 
interested in, right?).  Otherwise, you need to be using References: as 
well, not just In-Reply-To:.

While I don't hae a ready-made solution kicking around, I do have a recipe 
which I wrote for someone to deal with redellivery of a fscked up mailbox 
(some wannabe sysadm lunched mail delivery for their users and needed to 
recover as much as they could).  That involved grabbing the local SMTP ID 
from the local mailhost and using that in place of the messageid which 
formail cached.  Here, we can use it to grab In-Reply-To tokens and do the 
same thing.  Basically, we take that token and pass it along in a COPY of 
the headers of the message to formail to rewrite it and pass it along into 
formail to cache it.

# get In-Reply-To messageid, and if that header isn't found, use References:
# then check to see if it is in the ignore cache or in the mua_ignore cache.
# formail stores cache with NUL terminations, and, best as I assume, your
# MUA is using NL, which is why the two grep invocations differ.  Maximal
# matching is used so that if the first lookup succeeds, the second check is
# skipped.  If you set your MUA to invoke formail to cache ignored threads,
# then you can use ONE file and can do away with the separate checks, which
# will be a LOT more efficient.
# if we have a match in the MUA id file or current cache, ADD the messageid
# of THIS message to the cache

:0
* 9876543210^0 ^In-Reply-To: \/<[^>]*>
* 9876543210^0 ^References: \/<[^>]*>
{
         # Do nothing - we just set $MATCH one of two ways above
}

:0 A: ignore.cache.lock
* 9876543210^0 ? grep -Z "$MATCH" ignore.cache
* 9876543210^0 ? grep "$MATCH" path_to_mua_idfile
{
         # lockfile above already (which locked for the greps as well)
         :0Whc
         | formail -D 40000 ignore.cache

         # if the preceeding conditions matched, then file this message
         # away as irrelevant.
         :0
         $MAILDIR/irrelevant_list_threads
}

Note that we're not using the formail operation to _check_ the id database, 
just to update it.  Normally, if you use formail to check, it's still 
_adding_ the current ID, and you'd use the return value to determine 
whether the id was in there already.  That's fine for dealing with 
duplicate checking, not so useful for what you're trying to do.

I haven't subjected the above to extensive testing.  One obvious issue will 
be with MUAs which do in-reply-to MULTIPLE messages (treading the header 
more like References:).  The extraction of the id in In-Reply-To: (and 
References:) is formed specifically to grab just the FIRST ID - this isn't 
ideal (some iterative code would be necessary to get all of them), but it's 
better than expecting to match multiple IDs on a single line in a single 
lookup (which won't happen).

In fact, References: is another header you should be checking, and which I 
added code above to handle.  In my quick test of the above recipe with just 
In-Reply-To:, I noted that two messages in a recent procmail list 
discussion were not pulled aside - neither had an In-Reply-To, but instead 
had References:.  Updating the recipe to what you see above resulted in all 
the messages in that (particular) discussion being identified.

Ultimatley, it'd be a LOT easier to simply write a C/C++ program to take 
the In-Reply-To: and References:, combine them removing any dupes, then 
scan a cache file, and if found, insert the Message-Id to that cache file 
(after perhaps making sure it's not already there), returning a true/false 
as to whether the message is related to an ignored thread.  Whenever some 
bozo writes a message out of the blue as a reply (rather than composing it 
as a reply), or where people use crappy software, you can expect to have 
messages that don't get caught no matter what you do.  That's life.

Oh, did I mention that excepting for your MUA-managed file, the above setup 
automatically deals with keeping the cache file limited to a reasonable 
size.  This is yet another reason to consider tweaking your MUA to emit to 
the cache file via formail (or forwarding the message to a trivial procmail 
recipe that invokes formail).  Chances are, even if the overhead of adding 
an initial message to the list from the MUA was a lot more CPU intensive 
(and this really isn't that big of a hit), you'll still improve mail 
processing runtimes (only one grep needed) and eliminate the cron (along 
with the arbitrary cutoffs which it introduces).

0) I am aware that this will _also_ hide "new" threads made replying
   to the last received message and changing the subject, and that's
   fine with me

... probably because only morons piggyback new threads onto old ones, and 
who wants to read the ramblings of a moron... <g>

1) I _have_ already found and read
   http://www.it.ca/software/procmail-filter-msgid
   and the corresponding thread in the procmail list archives, but I'll
   confess I'm confused. Does it _really_ have to be so complicated?

Procmail doesn't have internal database features.  There isn't an external 
program all packaged up to do what YOU are trying to do (heck, even your 
MUA doesn't do it...)

 The
   recipe "flow diagram" above is just one check and two consecutive

The first condition is one to match the specific list(s) the guy's filter 
is supposed to operate on.  He obviously did it as a separate rule so as to 
keep the logic of the second rule (the "meat") simpler, otherwise adding 
additional lists or posters could break the second rule.  Technically, he 
should have used maximal matching (score like 9876543210^0 instead of 1^0) 
for the list id, which would allow him to have a lot of lists and it'd stop 
checking the instant it matched to one.

Then, provided that the prior condition matched, he grabs the messageid, 
and then truncates that and provided the From: addresses in those posts 
match addresses of the guy he's killfiling, he echoes the messageid token 
to a cache file.

The second indented rule checks for a filtercache file, and then greps the 
ENTIRE HEADER against the lines in the filtercache file.  Ugh.

the last level of indentation adds the current messageid into the 
filtercache if and only if it isn't already in there.

he's got another bug in the script (probably from debugging it):  the 
verbose=on is commented out, but the verbose=off at the bottom is always 
on.  So under certain conditions (whenever the list matches), logging 
verbosity will get turned off after this recipe runs.

a better way(short of implementing a PUSH/POP mechanism) would be to 
preserve the existing verbosity level, then restore it:
         VERBSAVE=$VERBOSE
         VERBOSE=ON
         ...
         VERBOSE=$VERBSAVE
         VERBSAVE=


   actions if the check succeeds. Maybe I'm naive, but I was expecting
   the recipe to be more or less the same length (3/4 lines). What am
   I missing?

I don't get understand.  You think the whole thing should be accomplished 
in 3 or 4 lines?


I'd say you owe me some beer, but as of yet, nobody has made good on that 
tab...

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail