procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-05-06 22:04:17


Dallman Ross wrote:

Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:

Dallman Ross wrote:

Okay, that's not bad.  I'll note that a dot followed by a plus suffices
in place of your "not whitespace" syntax, however.  That's in fact what
I meant with the plus signs in my line:

* $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
............................^...........................^

"One or more chars on this line" is essentially what's in the parens
there, in pseudo-code language.

.+ matches whitespace. I had wanted some optional lines that are "visibly non-empty" if they are present; i.e. there must be some visible character on the line and not just white space.

We can slightly optimize the scoring.  We don't need to 1^0 push at
the top if we reverse things.  We can count each instance of $LOCATIONS
once, and subtract each instance of ROAD WORK followed by $LOCATIONS.
I'm afraid the mailman package is going to wrap my lines again where
I don't want them wrapped, so I will break up the second condition with
a continuation slash. This is similar to the above of mine, but with weighting added and with the `\/' match token removed:

:0:
* $  1^1  B ?? ^\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>
* $ -1^1  B ?? ^\[ ()[0-9]:.* ROAD +WORK$(.+$)*(.*\<)?\
               $LOCATIONS\>
KEEP_ME

:0 E  # else
{ HOST = byebye  # slightly more efficient that saving to /dev/null }


That works fine.  I just tested it under 3.22 and NetBSD 1.6.  It has
an advantage over what you show, because sometimes the locations appear
more than once in an entry.  You'd get bogus negative scores then.
My scoring anchors each instance of $LOCATIONS from the start of the entry to avoid that.

I'd be very curious to see if that scoring fails in any way on your
system.  I have a new theory about what's going wrong on your system,
anyway

Okay, but from your ideas below, a non-scoring recipe does sound more appealling giving the weirdness I've been seeing.


For reports like the one in the original posting with two road work events in Menlo Park and one road-work event on Dumbarton and no non-road-work events in locations of interest, this filter works with procmail 3.15.2. However, it doesn't work on procmail 3.22 because in the last condition, two occurrences of Dumbarton are counted

There is a known bug in procmail concerning putting B on the initial
recipe line, where it then can't be turned off in subsequent recipes.
Or is it putting H on the initial recipe lines?  Or both?  I don't
remember, for two reasons: one, I hardly ever use body greps (my
extensive spam traps are almost all header-only checks); and two,
I always use the alternative syntax I show above, specifically
because of this bug.  So I don't run in to it personally.  However,
could it be that your rc is invoking H or B flags on recipes and
then running into this bug?  Instead of

        :0 B
        * body condition

Try

        :0
        * B ?? body condition

I didn't know about that bug. I'll have to revisit all my recipes.

This approach cheats in that it attempts to list all the complementary events to road work (i.e. these are the events I want to see as opposed to the ones I don't want to see). What I don't like about this recipe is that some new classification could appear in the traffic reports (e.g. "disaster" or "flood"),

All right, here is a way around that.  We define "not road work"
and use it.  Here it is.  If you plug it in to your recipe, it
should work just fine.

 SPACE  = " "
 TAB    = "        "
 WS     = "$SPACE$TAB"

 NOT_RW =          "[^R]|R[^o]|Ro[^a]|Roa[^d]|Road[^$WS]"
 NOT_RW =  "$NOT_RW|Road[$WS][^W]|Road[$WS]W[^o]"
 NOT_RW = "($NOT_RW|Road[$WS]Wo[^r]|Road[$WS]Wor[^k])"

I use something similar for "$NOT_RCVD", which I use to check for
split Received: headers in mail.  (It's one of the spammers' tricks.)

I think that idea needs a tweak assuming some word anchors around NOT_RW:

In 2D: NOT_AB = "[^a].|a[^b]"

In 3D: NOT_ABC = "[^a]..|a[^b].|ab[^c]"

In 4D: NOT_ABCD = "[^a]...|a[^b]..|ab[^c].|abc[^d]"

and so on. I'll use this idea in a non-scoring recipe.

Thanks,
Kevin


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail