procmail
[Top] [All Lists]

Re: regexp doesn't match longest possible substring

1996-08-07 11:54:43
Mike Rose asked,

| The procmail internal egrep does not match the longest possible
| substring (left precedence).  Is the behavior of the pattern matcher
| specified, so it is always possible to determine what MATCH will be?
| If the behavior is unspecified, ...

It's described in one of the man pages: probably procmailrc(5).

| ... this is a problem when using \/ to fill MATCH.

Procmail, in the absence of the \/ operator, matches on the shortest match,
but in the absence of \/ it doesn't really matter for practical purposes and
Stephen says that only as a programmer's note.

When there is extraction going on, procmail assigns the least possible text
to the left side and the most, both leftward and rightward, to the right side.

| Here's a simple example:
| ----------------------------------------
| # The " *" in the subject regexp match does not match the
| # longest possible regexp.  Send email with a subject line
| # "foo          bar".  The " *" does not absorb all spaces
| # after "foo", which leaves those spaces in MATCH.
|
| :0h
| * $ ^Subject: foo *\/.*
| |echo "match is \"$MATCH\"" >> bug.log
| ----------------------------------------

[That recipe needs an `i' flag, by the way, to shut procmail up when the
 action line doesn't read stdin, and the "$" modifier on the condition
 is unnecessary.]

Since a null string matches " *" and a string that begins with a space
matches ".*", indeed, $MATCH will begin with the space after "foo".
This is different from the behavior of sed or grep, where the asterisk
on the left gets preference over the asterisk on the right.

If you want all spaces after foo to be kept out of $MATCH but also want
to allow for there being no spaces, you have to code like this:

 * ^Subject: foo *\/[^ ].*

That will extract from the first non-space after "foo" to the end of the
line.

| In other regexp programs one can use parens to group and
| numbered variables to extract specific groups.

Procmail's egrep allows for grouping with parentheses but not for numbered
back-referencing.  (Back-referencing can be simulated but it takes some extra
code.)

<Prev in Thread] Current Thread [Next in Thread>