procmail
[Top] [All Lists]

Re: First tests of: Simple recipe to move uninteresting threads in separate mailbox

2006-12-31 10:46:22
At 14:08 2006-12-31 +0100, M. Fioretti wrote:
The result of the very first run is below. The "extraneous locallock
file" warning is my fault, because I did not change the lock file name
according to the cache names I changed, right?

No, see below.

Apart from that, there is the fact that, when a message has both I-R-T
and References headers it gives a duplicate entry. This should be
solved, I think, piping the output of sed to uniq.

That's not totally necessary - you add the overhead of the additional call 
to uniq (where in many cases it won't apply - I've certainly not seen a 
plethors of In-Reply-To: *AND* References: in the same messages) in trade 
for reducing the number of lookups when you hit the grep operation.  If 
there are duplicates, it won't cause a problem - if they're in the db, 
it'll match on the first one and that's all you need it to do.  OTOH, 
there's no problem if you do add uniq to the REFSNL processor, it's just 
extra CPU cycles, and yea, when the duplicate strings are NOT in the cache, 
it'll speed up the grep somewhat.  Your call.  I optimized the id list in 
favour of eliminating bogus matches (like looking for a newline, which 
would match all lines).

procmail: Assigning "INCLUDERC=/home/marco/.procmail_irrelevant_threads"

I can't speak for others, but often a .procmail/ directory is where 
individual rcfiles get placed - so you have ~/.procmailrc and ~/.procmail/ 
and all your included rcfiles are within the subdir, where they're not 
hidden from directory views and not cluttering your home directory.  It's a 
lot easier to manage your procmail stuff if the bulk of it is segregated 
from non-procmail files.

procmail: Match on "."

FTR, I didn't explain the condition that this is associated with, but all 
that does is says "if REFSNL isn't EMPTY", since messages without 
references needn't be processed by the rule.

procmail: Executing "grep,-qF,<45972349(_dot_)7040708(_at_)tacocat(_dot_)net>
<45972349(_dot_)7040708(_at_)tacocat(_dot_)net>,/home/marco/.procmail_ignore.cache"
grep: /home/marco/.procmail_ignore.cache: No such file or directory
procmail: Non-zero exitcode (2) from "grep"

Note that this isn't a problem - it's still _no_match_, and thus doesn't 
need special dispensation to check for whether the cache file is there or not.

procmail: Extraneous locallockfile ignored

That's my fault, and I should know better.  despite the fact that I want 
the lockfile for ensuring that my READ access to the files (during grep) 
isn't affected by an update (from mutt, or another concurrent procmail 
invocation), the lockfile won't do that because it expects to occur only if 
there's a delivery type action for this recipe - there isn't at this level, 
only a braced action.  Some reorg of the recipe (which results in a 
streamlined action anyway) solves this (retaining the bracing would require 
use of the LOCKFILE pseudo-variable, which is unwieldy).  I've got a 
rewrite at the bottom of this message.


Although I expect you might be coming around to the idea of just using 
formail to manage the id cache from your MUA as well, in case you don't, 
note that there's a further rcfile simplification: condense the two grep 
operations to one line and eliminate the scoring:

* ? grep -qF "$REFSNL" ignore*.cache

grep will only end up searching those cache files which exist 
(ignore.cache, ignore.mua.cache - the latter of which is renamed here to 
make it easier to match with a single focused wildcard name).  So, besides 
simplifying the rcfile makeup, in the event that the ignore.cache doesn't 
exist (which, well, should really only happen when you haven't yet tagged 
anything), you don't end up with TWO invocations of grep - it only runs 
once either way.

Note that providing two separate filenames on the commandline will cause 
grep to bail if one of them doesn't actually exist.  The wildcard gets 
around that, because grep is only seeing the filenames which do exist.

So, you have the log output from one run, apparently for a message that 
wasn't matching against something in your MUA hitlist.  Seems like you'd 
want to see it in action.


here's a further revision:

# simple recipe to ignore threads based on prior cache of threads to ignore.
# 20061230, SBS

# get In-Reply-To messageid, check to see if it is in the ignore cache or
# in the mua_ignore cache.  formail stores cache with ascii-z terminations,
# but grep will still match the binary file.
# if we have a match in the MUA id file or current cache, ADD the messageid
# of THIS message to the cache, so that replies to it will also be ignored.

# ensure these are blank, not set to something you might have used them for
# previously
REFS=
REFSNL=

:0
* In-Reply-To:.*\/[^    ].*
{
         # Assign the results to REFS
         REFS=${MATCH}
}

:0
* ^References:.*\/[^    ].*
{
         # Append the results to REFS
         # no consideration as to whether REFS was null or not.
         REFS="${REFS} ${MATCH}"
}

# by doing this ONLY if REFS contains non-whitespace, we spare
# ourselves the overhead of the pipe chain invocation when it isn't
# needed (i.e. messages with no references).  Arguably, REFS shouldn't
# be set at all if the headers are empty, but this check is cheap to perform
:0
* REFS ?? [^    ]
{
         REFSNL=`echo "$REFS" | tr -s "  " "\n\n" | \
                 sed -e '/^\([^<].*\|.*[^>]\|\)$/ d'`
}

:0hc:ignore.cache$LOCKEXT
* REFSNL ?? .
* ? grep -qF "$REFSNL" ignore*.cache
| formail -D 40000 ignore.cache

# if the preceeding conditions matched, then file this message
# away as irrelevant.
:0A:
irrelevant.threads



That condition for invocation of REFSNL shows as follows in the verbose log:

procmail: No match on "In-Reply-To:.*\/[^       ].*"
procmail: No match on "^References:.*\/[^       ].*"
procmail: No match on "[^       ]"
procmail: No match on "."

Basically, no references, no impact of real work.  Since there's a fair 
number of originating (i.e. not followup) messages, this is a good thing.


A few things to ponder:

1. If prior to this recipe, you had list identification rules, which set a 
variable for the listname but didn't actually deliver, you could employ 
that in determining the filename for the irrelevant thread file - i.e. 
having list-specific files.

2. Since cc'd messages will bear the same headers as the listbound copy, if 
you're cc'd on a thread which you're ignoring, you'll be ditchihg it 
here.  You may want to add some logic prior to this ruleset which takes 
direct cleartext addressed correspondance and delivers it accordingly.

3. If at a later date, you reprocess your mailbox or irrelevant.threads 
files, actions will be different, as the cache will be in a different state 
(the same holds true if you were using a cron cycled killfile).  If the 
same cache is in effect, then things will still be filtered - but as ids 
dissappear from the cache, so to will their effect on other replies.  Just 
something to keep in mind.

4. Based on the messageid cache mechanism for ignoring, there's no reason 
that someone using a non-shell mailer can't set up a forward rule and 
forward key messages to ignore to themselves with a special key to trigger 
a recipe to grab the messageid from the header and add it to the 
cache.  Much like so:

# grab messageid from body (i.e. forwarded with headers context)
:0b:ignore.cache$LOCKEXT
* From: expression-matching-thyself
* Subject: ignore THIS keyword
| formail -D 40000 ignore.cache

This should present the basic mechanism for accomplishing the task, though 
you should add whatever you see fit to ensure this isn't something that can 
be arbitrarily manipulated by some passerby on the net.


Now, after all that, let me wish you a new year with fewer trolls and 
nonsensical threads needing to be filtered out <g>

So, where's my beer?  I accept drop shipments from the UK, home of porters 
and cream stouts.  <g>

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>