procmail
[Top] [All Lists]

RE: Regexp fails in scoring recipe

2003-05-05 16:21:41
Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote on Thursday, May 01, 2003:


Dallman Ross wrote:

Okay.  Meanwhile, I looked harder at the original post of yours, at 

http://www.xray.mpe.mpg.de/mailing-lists/procmail/2003-04/msg00372.html

Thank you. I know it's hard to find the source of the problem 
from only the information in the original post. One possibility is 
that my server had an update of Solaris applied, and procmail doesn't 
cope with it well in all cases. I have an interest determining whether

this is a bug in Solaris.

Meanwhile, if you send me a sample copy, I'll play with it.

Well, I took a few days off before tackling this, but I have a
working recipe for you.  I still do not know why your scores are 
different in your test environment (where your recipe works, as you 
said) from what they are in your working environment (where the 
scored don't come out right).  I would be interested in finding out 
why, however.  (I was not surprised to learn bogofilter had nothing 
to do with it.)  I am inclined to believe that some aspect of your
server environment acts differently when you are logged in from
how it acts when you are not.  Maybe in one case procmail runs
under your uid and shell, but otherwise runs under suid root and
root's (postulated to be different) shell?  Do you have a shell
definition line in your .procmailrc?  I recommend "SHELL = /bin/sh".

In any case, now that I have seen some sample traffic reports and 
received two directly myself upon having subscribed, I found that a 
recursive SWITCHRC can work for this and give you abbreviated reports 
that show you just what you want and not the cruft you don't want.

First, though, some precursor stuff.  You had in your recipe,

  LOCATIONS="(dumbarton|(east )?palo alto|stanford|menlo park|\\
            redwood city|mountain view)"

As I mentioned in a previous reply, you could have a problem with
the word breaks.  I see in the actual reports that once in a while
the spacing between words is inconsistent, which only corroborates
my earlier concern.  I believe it would be easy to have, e.g.,
"PALO  ALTO" (with two spaces between) show up instead of what
you were expecting, and you'd miss it.  Also, a line end could
happen in the middle of the phrase.  I recommended using only
one word (and had said that you don't, in any case, need the EAST
for EAST PALO ALTO, since you are accepting the second two words
anyway).  If you don't want potential false hits with "REDWOOD"
or "MOUNTAIN", however, then here's another way.  A tab and a 
space are found inside of each of the two pairs of square brackets:

WRDBRK    = ($[         ]*|[    ]+)
X         = $WRDBRK
LOCATIONS =
"(Dumbarton|(East${X})?Palo${X}Alto|Stanford|Menlo${X}Park|\\
             Redwood${X}City|Mountain${X}View)"

All right, I used the above in my test harness, and it worked fine.
Here is the main recipe I put below that (goes in .procmailrc):

#-------------------------------------------------------------
 :0
 * ^From: KPIX\(_dot_)Traffic\(_dot_)Router(_at_)kpix\(_dot_)com
 * ^Precedence: bulk
 * $ B ?? ^\/\[ ()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
 { SWITCHRC = traffic }
#-------------------------------------------------------------

(I added the "Precedence:" check because you are /dev/nulling the
reports that don't have a city of interest in them, and I imagine that
the list administrator might write you some time with an announcement
that you'd otherwise miss.  In my confirmation mail from the list
for signing up, for example, there was no Precedence: header.)

Now I made a separate rc-file called "traffic".  That gets
run recursively.  It's important to have a breaking occurrence
in a recursive rc; otherwise, it will iterate until your server
goes kablooey, or something.  :-)  I tested this one on two of
the actual 8-a.m. traffic reports from KPIX:

#-------------------------------------------------------------
 :0 Dich:
 * ! MATCH ?? ^^(.* )?ROAD +WORK$
 | echo "$MATCH" >> somefile

 :0
 * $ B ?? ^$\MATCH(.*$)*\/\[
()[0-9]:.*$(.+$)*(.*\<)?$LOCATIONS\>.*$(.+$)*.*
 { SWITCHRC = $_ }
#-------------------------------------------------------------

(Heh.  Note that there's no scoring.  Not that I have anything against
scoring, but . . . I didn't need it.)  That long condition might wrap
before it gets to the list, so I'll put a version here with a line
break:

 * $ B ?? ^$\MATCH(.*$)*\/\[ ()[0-9]:.*$(.+$)*(.*\<)?\
           $LOCATIONS\>.*$(.+$)*.*

Okay, I ran that against the Friday and Monday 8:00 a.m. reports I
got.  There was only one thing of interest for your chosen cities,
and I wanted more for testing.  So I added Berkeley to the list.  Here 
is the complete contents of "somefile" after running both days' reports
through the above:

 12:30am [~/Mail] 1206[0]> cat somefile 
[ 7:50 AM]  050305   ACCIDENT
PALO ALTO : SB 101 BEFORE UNIVERSITY AV ... ACCIDENT THREE CAR FENDER
BENDER,
ON RIGHT SHOULDER  -- 411 -- (CHP)  #R

[ 7:39 AM]  101202   SLOWDOWNS
BERKELEY : WB 80 TRAFFIC IS SLOW FROM GILMAN ST TO 580/880 IN OAKLAND
...  --
(AIRBORNE)

[ 7:44 AM]  051202   SLOWDOWNS
RICHMOND  : WB 80 TRAFFIC IS SLOW FROM RICHMOND PKWY TO SAN PABLO DAM RD
, 
AND THEN AGAIN FROM CENTRAL AVNUE IN RICHMOND TO UNIVERSITY AVENUE IN
BERKELEY
...  -- (AIRBORNE)

------------------------------------------------------
Woo-hoo!  We got milk.  Make sure your LINEBUFF is big enough.  You
might
want to set it to 4K at least just for grins.

Hope that helps,
Dallman

-- 
Dallman Ross

"If you find a path with no obstacles, it probably does not lead to
anywhere."
        Thoughts of Rev. Sunnan Kubose, from _Zen in the Markets_ 



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail