Jim Dennis <jimd(_at_)starshine(_dot_)org> writes:
I think I'm finally getting the idea on these $MATCH
settings. I've seen the entry in the man pages and
just plain avoided using them (relying on my awk and
perl scripts to actually do the extraction for me.
Here's the excerpt from the man page:
MATCH This variable is assigned to by procmail when-
ever it is told to extract text from a match-
ing regular expression. It will contain all
text matching the regular expression past the
What I didn't understand from reading this -- and only
vaguely was seeing in the many examples (procmailex and
here on the list) was that the procmail regex pattern
consists of two parts -- the condition pattern which
determines if the recipe is used and the optional
part that sets the $MATCH variable. The '\/' "fence"
token separates these.
Well, yes and no. For the condition to match and the MATCH variable
to be set, the entire condition, ignoring the \/ token, must match.
Many recipes that have a \/ token may match zero characters on the
righthand side due to use of the '*' qualifier. However, you can
just as well require characters on the righthand side, and if procmail
can't match them as it goes, the entire regexp fails to match. For
example, in your very next example there *must* be a space after the
"foo" or the regexp will fail to match.
So the condition:
* ^Subject:.*foo\/ *(bare?|b[oe]?ar) *
... should be met by any subject containing 'foo' and
set the $MATCH to " bar " or " bare " or " boar " or
" bear " (with any surrounding spaces).
If the subject doesn't match the regexp:
Subject:.*foo *(bare?|b[oe]?ar) *
then the condition will fail and MATCH will not be set. If it _does_
match, then MATCH will be set to have all the spaces between the "foo"
and the "b", one of bar, bare, boar, bear, then all the spaces
immeadiately following that. BTW: only one of the question marks in
that regexp is needed, as they both merely add "bar" to the set of
matched phrases. You also could have use " +" instead of " *".
Am I right? If so -- I think this once again underscores
the need to rewrite the documentation a bit more verbosely.
That is different then the regex' used by most other
*ix utilities -- although strangely similar to the
old ed s/foo/bar -- as though you said /search/ for the
first regex and "substitute" $MATCH with the second regex.
THERE IS NO SUBSTITUTION GOING ON HERE. If you want to compare it to
something, compare it to the \( \) tokens in sed or perl which allow
you to capture text for later use. In perl-like syntax, the above
would have been:
$header =~ /^Subject:.*foo( *(bare?|b[oe]?ar) *)/m;
$MATCH = $1;
Actually, to fully emulate procmail you would have to use perl5 regexp
extensions and write that as:
$header =~ /^Subject:.*?foo( *(?:bare?|b[oe]?ar) *)/m;
$MATCH = $1;
This brings me to the last tricky point with procmail regexps: unlike
99% of the regexp engines out there, procmail does *MINIMAL* matching
on the lefthand side of a \/ token, or if there is not \/ token. Most
regexp engines, as they attempt to match a regexp, if they come to a *,
+, or ? qualifier, will attempt to take the greatest number of
interation then and there, doing fewer only if the later parts of the
regexp are unable to match. For example, the regexp
foo .* (bar.*blip|baz)
when matched against:
foo ---- bar ---- baz ---- blip ----
will match the section
foo ---- bar ---- baz
even though it can also match the longer section
foo ---- bar ---- baz ---- blip
This is because regexp have no foresight. When they're greedy, they
take as much as they can as soon as they can. The first ".*" in this
example will first eat up the entire line, then the engine will back up
until it can find a space (the .* must be followed by a space says the
regexp), then it'll back up until it can match the tail part of the
regexp. The first place it can do that when backing up is when it
backs up to "baz", so that the choice it takes in the alternation.
Procmail is different: when it encounters a *, +, or ? qualifier, it
first tries to match as few times as possible (0, 1, or 0 times
respectively), and only matches more if it needs to in order to match
later parts of the regexp. Given the same regexp and input as above,
it'll match because that is the first match that it encounters in its
foo ---- bar ---- baz ---- blip
I'll note here that for simple boolean tests, minimal and maximal
matching give the same result. Either there is or there isn't a match,
and either method will find it. It just so happens that minimal
matching is 'usually' faster for the sorts of regexps that procmail
gets. Where it makes a differance is when something like procmail's
\/ token or perl's ()'s allow you to see what the engine matched.
In order for the \/ token to be practically useful, procmail turns around
and does _maximal_ matching on the righthand side, while still doing
minimal matching on the left. This usually does what you want, but there
are exceptions. To use a real example from this mailing list:
^Subject: +get +file +\/[^ ]*
The intent here is to take a subject like:
Subject: get file picture.ps
and match the name of the file to be retrieved, putting "picture.ps"
into MATCH. Unfortunately, when procmail goes to match this, it first
tries to match only one space after "file", that being the minimum. It
then tries to match the rest of the regexp "[^ ]*" against the
remainder of the subject, namely " picture.ps" (notice the leading
space). It *suceeds*, matching *zero* times, and leaves MATCH empty.
The solution is to force procmail to match at least one character on
the righthand side of the \/ token by changing the star to a plus:
^Subject: +get +file +\/[^ ]+
Now, when procmail tries to match only one space after "file" with the
" +" it can't match the rest of the regexp, so it has to back up and
match another space. This time it suceeds, and MATCH will contains
"picture.ps". The moral of this whole section is to warn new users
that when using the \/ token:
FORCE THE RIGHTHAND SIDE TO MATCH AT LEAST ONE CHARACTER
This can often be done by changing a '*' right after the \/ to a '+'.
If the above interests you but is too short or complex, consider
reading a book like "Compilers: theory and practice" (the so-called
"Dragon book") as it explains not only why the above takes place, but
shows you how to write your own regexp engine.
If someone converts the above to HTML and puts it in a FAQ like setting,
can you send me the URL so I just quote that next time this pops up?