Matching "^" (Re: Adding Lines: header)

Nobody ever answered this:

On Wed, 4 Jun 2003, Holger Wahlen wrote:

By the way, I was wondering whether it wouldn't do to use just

  * 1^1 ^

instead, but this always yields the maximum score (2147483647). Any
ideas why?


I believe this happens because ^ matches either the start of the entire
buffer, or a newline between lines.  (Just as $ matches either the end of
the entire buffer, or a newline between lines.)  At the start/end of the
buffer, ^ and $ match a zero-width boundary, whereas for most embedded
newlines they match the actual newline character.  I say "most" because
when NOT using scoring, using ^ at the begining of a pattern plus $ at the
end of the pattern matches two newlines, but in scoring matches only one.

If this were not the case, "* 1^1 ^.*$" would count the first, third,
fifth, etc. lines, breaking the text up like so:

      (^first$)
        second(^
        third$)
        fourth(^
        fifth$)
        etc.

Also when scoring, procmail has to ignore the part that already matched
before it begins matching again.  So the unexpected behavior is that when
"^" matches the zero-width start of the buffer, the scan starts over at
beggining of buffer, because there is no "already matched part" to ignore;  
and thus matches again.  This repeats until the score hits the maximum.

In fact, anything that can match an empty span will score the maximum,
including:

 * 1^1 ()

 * 1^1 .*

 * 1^1 $

And the truly baffling:

 * 1^1 $^

 * 1^1 (.|$)

The latter being the reason that to compute the size of a message you
must either use the special case:

 * 1^1 > 1

or separately count non-newline characters and newlines:

 * 1^1 .
 * 1^1 ^.*$

(which actually counts one too many).


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail