Re: Efficient use of OR-matches and $MATCH

posting-list(_at_)MailAndNews(_dot_)com (Jari Aalto+mail.emacs) writes:

From my experience with perl and Emacs REGEXP engine, the simple regexp
matches are very efficient and fast. If you add grouping modifiers 
like (xxxx)? and (xxxx)* (xxx)+ then it might get tad slow easily.

In your case, if I understand, you only have simple OR matches.


Actually, they're pretty expensive if there's nothing "before" them.
When there's more than one 'atom'** in the regexp that can be the first
matched, procmail's regexp processing time will typically be doubled
or tripled.  This is true of NDFA regexp engines in general, not just
procmail.  The reason is that each alternative requires at least one
'epsilon' node to be processed for each character.  For example, the
regexp
        ^(To|Cc):.*blah
is *much* faster than
        (^To|^Cc):.*blah

in the former, the '^' must be matched before anything else.  In the
later, procmail must try each alternative for a match.  They may be the
same, but procmail's regexp engine doesn't know that.  Yes, the engine
could be taught to recognize and optimize such, but it's not easy and
I haven't had the time to even think about doing it.

This entire subject is pretty complicated.  I strongly urge anyone
interested in improving their regular expression writing skills to
read the book "Mastering Regular Expression" (I think that's the name)
published by O'Reilly and Associates.  It explains all of the above and
*much* more in clearer language than mine.

This is as fast as it gets and the regexp size does not really matter.
They are constant string and if procmail uses DFA engine, the first one wins
and the regexp scanning is stopped.


Procmail uses an NDFA engine instead of a DFA one.  However, even with
DFAs a leading alternation is expensive.  How expensive depends on the
implementation and the internal representation of the compiled regexp.

So, I would say that there is nothing to optimize here, not atleaset with 
separate recipes. It might give you feew microseconds in some cases, but
it's not noticeable enough to make is justifiable to use several
no-so-maintenable recipes.


Left-factoring will almost certainly improve the speed of a regexp,
even if it's just a single character.  For an example of left-factoring,
consider the expansion of the ^FROM_DAEMON token.  Notice how instead
of saying:
        mmdf|majordomo
it says:
        m(mdf|ajordomo)

and instead of:

        request|response|root
it says:
        r(e(quest|sponse)|oot)

In truth, it could be sped up even more by even more aggresive
left-factoring, but the cost in size, readability, and maintainability
outweighs that in my mind.

Philip -- Any change to get a TIMING_ON and TIMING_OFF constants to progrmail
and TIMING_VALUE value containing the wall clock time (microseconds?) procmail

 spent

during region under test? Currently there is no good way to find how long
it takes to process a recipe.


Sure there is: create an rcfile with 1000 copies of that recipe, then run
procmail under the time command a dozen times with a 'typical' message and
that rcfile.  Average and divide by 1000.  You need to run a timing test
several times anyway to get any reliablity in the answer, so moving the
timing to outside procmail doesn't complicate the situtation *that* much.
(Followups on this topic should be directed to the procmail-dev list...)


Philip Guenther


** 'atoms' are the most basic unit from which regexps are constructed.
By themselves, they always match exactly one character in the input.
In procmail all of the following are atoms:
        a literal character
        .
        a character class (e.g., "[a-z]")
        \< and \> (these are just special character classes as far as
                procmail is concerned)
        ^ and $
        ^^
        any special character escaped with a backslash (but *not* "\/")

Regular expressions are then built from atoms using concatenation,
alternation, grouping, and repetition.