Re: Outmatched by MATCH

Erik Christiansen <erik(_at_)dd(_dot_)nec(_dot_)com(_dot_)au> writes:

      The second rule below grabs spaces, even though they are excluded, by
omission from the match list:

OR=2147483647^0                     # Max score => immediate success. i.e. OR

  :0:                                    # (Topic|Subject) ~ project
  * $ $OR ^(X-)?Topic: +\/[^ ]+
                                         # Allow also "Re: Fw: xxx"
  * $ $OR ^Subject:[  ]*((re|fw):?[   ]*)+\/[0-9a-zA-Z_-]+

  Given that "It will  contain  all text  matching the regular expression
past the `\/' token", how does MATCH then come to hold "Fw: n2ip_nms",
as logged:

...

      How can it do that? (Or, perhaps more to the point, what am I
      doing in an unprocmailian way? :-(


There are two things going on here: one is a bug in your condition,
the other is a bug in procmail.

The bug in your condition is that it requires maximal matching to the left
of the \/ token, something procmail doesn't do.  The result is that the
first '+'ed subregexp in your condition will match exactly once, because
"Re" and "Fw" are both valid matches for what comes after the \/ token.



The bug in procmail then comes into play.  Sometimes, when there is a '+'
operator to the left of the \/ token that, by procmail's "stingy on the
left" rules, should never match its subregexp more than once, procmail's
extraction code will get confused and merge the "start of match position"
from one tenative match with that of another tenative match that had that
'+' operator matching a different number of times.

To put it another way, the code makes what appears to be a valid
assumption that isn't really valid.  Fixing it will be a real pain,
which is why I haven't done so yet.  Fortunately, regexps that run into
the bug are relatively uncommon---this is only the second case I've heard
of---and there appears to always be an equivalent, simpler regexp that
doesn't trigger the bug.

In this case, the simpler regexp is

        ^Subject:[      ]*(re|fw):?[    ]*\/[0-9a-zA-Z_-]+

I.e., don't bother trying to match more than one 're' or 'fw' because
procmail's "stingy on the left" rules will never do so.

Of course, _that_ brings us back to the bug in the condition: procmail's
interpretation of the regexp doesn't match your intent.  Hmm, there's
another bug in your condition: it doesn't require the 're' or 'fw' to
be separated from the matched text by anything.  Consider what happens
when the message header contains the field
        Subject: recall

Do you really want it to match "call"?

Finally, here's one other "what do you really want?" to ponder: should
the condition match any of the following fields?  If so, what should
it extract?
        Subject: re: re
        Subject: re: re:
        Subject: re: re  $)*(@#$
        Subject: re: re: $)*(@#$


Okay, so how would _I_ extract the first word after any 're's or 'fw's?
First off, let's fix the "recall" bug by requiring the 're' or 'fw' to
be followed by either a colon, space, or tab, _then_ zero or more spaces
or tabs.  Then, we'll start by extracting everthing from the first 're'
or 'fw' to the word that we really want.  That puts the target word at
the end of MATCH, so we can just extract that off with a regexp anchored
at the end:

        :0
        * ^Subject:[    ]*\/((re|fw)[:   ][     ]*)+[-_0-9a-z]+
        * MATCH ?? ()\/[-_0-9a-z]+$
        {
                ...
        }

I'm sure you can figure out how to make that work in your 'OR' case...


Philip Guenther
Procmail Maintainer
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail