procmail
[Top] [All Lists]

Re: bug: overoptimizing RE

1998-01-21 14:44:15
Eli wrote,

| This was shown to me on procmail 3.11pre3, and I have duplicated it
| on 3.11pre7. Apparently in some cases REs are not "greedy" on the
| right side of a \/ capture.

That's not the case at all.  In Eli's example the right side was as greedy as
it could be; the problem is that he seemed to expect greed on the left as
well.  Let's examine it:

| ------ rc file ------
| VERBOSE=y
| :0:
| * ^Subject:.*Keywords.*\/[0-9]*
| /tmp/$MATCH
| ------ rc file ------
| 
| ------ test message ------
| From just(_at_)test
| Subject: Keywords 9999
| To: just
| 
| baodyu
| ------ test message ------

... and MATCH is set to null, contrary to Eli's expectation.

It is not a bug but rather a frequently misunderstood effect of the way
extraction is advertised to operate.  This has come up before, and Philip
Guenther has posted a long illustration.

Remember that only the right side is greedy; the left side is stingy, and
left-side stinginess takes precedence over right-side greed.

Extraction is implemented this way: the entire expression, left and right, is
pinned to the shortest possible match; then the division mark is placed and
the right side is repinned to the longest possible match starting at the di-
vision.  The tricky part is to remember that the division is marked during
the stingy stage.

If the expression is

 ^Subject:.*Keywords.*\/[0-9]*

and the text is

 <newline>Subject:<space>Keywords<space>9999<newline>

then the shortest possible match to the entirety is

 <newline>Subject:<space>Keywords

because ".*" and "[0-9]*" both match to null.  Then the division mark is
placed on the space after "Keywords" and procmail looks for the longest
possible match to [0-9]* starting with that space.  That, again, is null,
so MATCH is set to null.

Eli noted that it works as he expected if the regexp is changed to this:

 ^Subject:.*Keywords.*\/[0-9]+

That is a whole other ball of wax.  Now the shortest match to the entirety is

 <newline>Subject:<space>Keywords<space>9

and the division mark is placed at the 9.  Then procmail refigures the
longest match to the right side starting at the division mark and sets
MATCH=9999.

Note that this would have given a non-null value to MATCH:

 ^Subject:.*Keywords\/.*[0-9]*

With the ".*" after "Keywords" moved to the right, it becomes greedy and
reaches all the way to the digits, and MATCH="<space>9999".

Call it counterintuitive, but it's not a bug.  General advice: always make
sure that the right side cannot match null or that the last element of the
left side cannot match null.

<Prev in Thread] Current Thread [Next in Thread>