procmail
[Top] [All Lists]

Re: Regexp fails in scoring recipe

2003-05-15 16:35:55


Dallman Ross wrote:

Kevin Wu [mailto:tessar(_at_)bigfoot(_dot_)com] wrote:
So even NOT_AB = "(.|[^a].|a[^b])" is not good enough. The right answer is

NOT_AB = "(.?|[^a].|a[^b]|(.*)(${NWB}..|${WB}[^a].|${WB}a[^b]))"

where $NWB is not a word boundary character and $WB is a word boundary character.

That's a very good recital of the problem and a decent proposal,
imho.  However, I *still* think it would work just as well (if not
better) this way:

 NOT_AB = "(.?|[^a]|a[^b])"

and so on.  Although I thought I understood the idiosyncrasies of
what you stated by way of explanation (elided here), I confess I
still don't get the reasons for two chars after your $NWB above.

By way of example, we want to avoid matches on

   xyz ab

but we want to match each of the following

   xyz tab
   xyz crab
   xyz Schwab

The regexp "\<(.*)(${NWB}..)$" succeeds on all these examples. It's true that my definition of NOT_AB has an implicit assumption about the boundaries around it, and that's not a desirable characteristic. But it works, and that's a desirable thing. Here's a way to make the definition more compact via logical algebra:

NOT_AB = "(.?|(.*)${NWB}..|((.*)${WB})?([^a].|a[^b]))"

In any event, my thought is that $NOT_AB
should stay a clean definition, and the regex can be built around
it to accommodate length of 0-infinity ${NWB} chars.

That would be great if it can be done.

In other words, the new recipe was matching road work events when the regexp was designed to match everything except road work events. To debug this, I used the \/ token to determine the matching text and put it into the log

Good trick (have done it myself).  :)


It works when a regexp matches something when you didn't expect it to match, but it doesn't work when the regexp fails to match when you expected it to match. The latter case applied to the original scoring recipe when it stopped working.

file. This is what I found:

A1: The traffic report body was in DOS format!

Okay, that's great that you found that; but why do the recipes
work for me with mail from the kpix traffic list?  I did test
it for a few days, after all.  And I also fired up vi (well,
vim) more than a few times on the traffic reports themselves,
and I never say ^Ms!

I don't know, but it may be related to the way the MTAs are configured on our mail servers. Perhaps your MTA is stripping the CRs whereas mine is not. None of the mail originating from within my employer's firewall have CR in the message bodies, and only a handful of externally generated messages (other than these traffic reports that I get four times each weekday) have CR in the bodies.

The traffic reports that I get now do not have CR on some lines of the message body. Only the lines with traffic event content have CR. The separator lines between traffic events do not have CR. I suspect the content is generated by some external feed, and KPIX formats and wraps the traffic content with KPIX-specific content. The external feed may have switched from a Unix platform to a Windows platform.

Kevin



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail