procmail
[Top] [All Lists]

Re: Identify a .forward[ed] message

1999-05-09 00:37:13
"David W. Tamkin" <dattier(_at_)Mcs(_dot_)Net> writes:
...
:0E: # otherwise
# see if there are at least three Received: headers to weed out mail that
#  came straight from its source without forwarding
* ^Received:(.+$)*Received:(.+$)*Received:

That should of course have '+'s instead of '*'s:

        * ^Received:(.+$)+Received:(.+$)+Received:

You caught yourself that in later posts.

...
[Philip, is there a better way to write the first condition of the second
recipe?  You completely lost me with your post about that.]

With the correction, that's probably the best way to write it.  I
_think_ that in general, expressions involving the '+' operator are
close to optimal as is, so that only those involving '*', '?', and
(some) alternations bear examining for possible rearrangment.

Perhaps a quick explanation of how procmail implements regexps would
help.  Matching is done by considering each character of what is being
matched against and keeping track as it goes of where in the regexp
that could have matched against given where the previous character
could have matched.

For example, take the following regexp:

        From:(.*$)+To:

and the following message header:

        Face-Info: really ugly
        From: foo
        Blah: bar
        To: baz

When procmail starts this, nothing has matched so far, the initial 'F'
in "Face-Info" can only match against the first character of the
regexp.  It does, so procmail makes note of that and goes to the next
character, the 'a'.  Since the previous character could match against
the first character of the regexp, procmail checks to see if this one
could match against the second.  It doesn't, so procmail knows that the
tenative matching of the 'F's was wrong and drops that 'branch' of
matching.  It next considers whether the 'a' could match the first
character of the regexp.  'a' != 'F', so it doesn't.  That eliminates
all the possible matches for the 'a', so procmail goes on the next
character with a clean slate.  It (the 'c') doesn't match the 'F' from
the regexp, so procmail keeps going.  That keeps happening until
procmail hits the 'F' on the second line.  There it gets a match, so it
'pushes' a possible 'branch' of matching there.  When the 'r's match,
the branch is 'extended'.  Note that procmail will still be trying each
character for a match on the first characters of the regexp.  When it
later hits the 'f' in "foo" it'll start a second branch (albeit a short
lived one).

Note that branches can fork.  They will do so on the first character of
the target of a '*' or '?' operation, on the last character of the
target of a '+' operation, and on the initial characters of an
alternation.  For example, when procmail hits the ' ' in "From: foo" it
has to fork the branch and make two checks: can the ' ' match the '.'
in "(.*$)", and can the ' ' match the '$' in "(.*$)".  The first match
succeeds, so that branch is kept, while the latter doesn't match and is
dropped.  The same thing then happens for each character up to the
newline, where the opposite occurs.  Having finished the target of a
'+' operator, procmail then forks the branch for the 'B' in "Blah:".
One branch considers whether to make another loop through "(.*$)" while
the other considers whether to continue past the '+'.  The latter fails
because the 'B' doesn't match the 'T' in "To:".  The former succeeds,
of course, and after matching through the rest of the "Blah: bar" line
and the newline after it, procmail has hit the end of the target of the
'+' operator, so it forks again.  Once again, one branch considers the
loop which the other goes on.  This time, the second branch keeps
matching.  When the second branch reachs the end of the regexp,
procmail knows that the entire regexp has matched against the text, so
it returns success.  Yes, the other branch is still matching, but
that's fine as procmail is only looking for a match/no-match result
here (this is why we say procmail does "minimal matching").  If the
extraction token was involved, say, via the regexp

        From:\/(.*$)+To:

then procmail would make note of the match but keep matching other
branches, looking for a longer total match.  Otherwise, it would be the
same.

There's quite a bit more involved to this of course.  Procmail has to
fake a leading newline onto the text to handle regexps that start with
'^', for instance.  The above should description should hopefully give
you enough of a picture of what's going on behind the scenes to help.


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>