procmail
[Top] [All Lists]

Re: REGEX

2000-12-19 02:19:32

To save time, I'm replying to all of you at once.


D E Hammond <procmail1(_at_)tradersdata(_dot_)com> writes:
Procmail's regular expression engine doesn't support all the extended
syntax of egrep and perl. (At least not in my version.) You can see
the supported syntax in man procmailrc, search for "Extended regular
expressions".  If you will be happy with the procmail equivalent to what
you've given above, it would look like:

* ^Subject:.*[ ][ ][ ][ ][ ]

where the bracket pairs each enclose a space char. The trailing ".*" is
unnecessary, unless you need it for $MATCH. (I'm assuming the spaces
must be bracketed, or would trailing whitespace be significant in a
condition?  Even if so, the brackets are good for clarity when you're
looking at this recipe months from now.)

Trailing whitespace _is_ significant.  While bracketing the whitespace
does work to make it stand out, parenthesis are prefered as they can be
processed more efficiently than the character classes created by
brackets.  I would probably write the above condition as either:

        * ^Subject:.*     ()
or:
        * ^Subject:.*(     )

If was important to me that the spaces be easily counted, then I would
use either:
        * ^Subject:.*( )( )( )( )( )
or:
        * ^Subject:.*( ) ( ) ( )

Those will all be processed as efficiently as if the parens weren't there
at all, something not true of the backet version.



But you might want to consider a couple more things. Are you positive
these are spaces, or might there be tabs? If so, you would want
something like:

* ^Subject:.*( |       )( |    )( |    )( |    )( |    )

where there is a space and a tab in each alternation.

Arg!  A character class is much more efficient than an alternation,
such that the above would be better written as

        * ^Subject:.*[  ][      ][      ][      ][      ]


To summarize: parens are free, plain characters are fast, character
classes are okay, alternation is slow.


Sergiy Zhuk <serge(_at_)yahoo-inc(_dot_)com> writes:
On Tue, 19 Dec 2000, D E Hammond wrote:

Procmail's regular expression engine doesn't support all the extended
syntax of egrep and perl. (At least not in my version.) You can see

it should support ":space:" though, but for some reason v3.14 that I had
didn't, so I had to put a workaround
"[^:alnum::punct::cntrl::digit::graph:]" which worked...


Procmail does not support named character classes.  This is partly
because they weren't in the original egrep, partly because POSIX failed
to provide an interface for implementing them (POSIX created them), and
partly because no one has gotten around to working on procmail's regexp
engine in a long time.

Besides, you don't even have the correct syntax: POSIX named character
classes require an extra set of brackets around the name, like this:

        [^[:alnum:][:punct:][:cntrl:][:digit:][:graph:]]

What you wrote will match any single character except newline or one of
        :alnumpctrdigh

So no, that still doesn't work.


John Summerfield <summer(_at_)OS2(_dot_)ami(_dot_)com(_dot_)au> writes:

* ^Subject: {5}

might match a subject line with five spaces.

It might, but only if the line started with the literal characters
"Subject: {5}" and had five spaces later on it.  Braces are _not_
special in procmail's regexps.  If you want an exact count, you have to
write out the desired number of copies, and if you want a range, you
need to write out the maximum number and stick in some carefully places
parens and question marks.  See below for an example.

Note that plain braces have *never* been special in egrep's regexps.
_Extended_ regexps and so-called _advanced_ regexps** use plain braces
to perform counts, but egrep does not use normal extended regexps.  GNU
egrep and grep can perform counted matches using escaped braces: \{ and
\}.  On the other hand, the POSIX-mandated '-E' option to grep and
egrep forces the use of standard extended regular expressions in which
case braces _are_ special.  Is this a mess?  Yes.  Did POSIX make it
better?  Maybe, but it sure didn't make it simpler.

** "Advanced regular expressions" is the name for perl's mongo regexp
syntax, including the "(?" extension method, now implemented in Tcl and
a handful of other packages.


Here's an example of a correctly parenthesised expansion of the
extended regexp "(foo){3,7}"

        foofoofoo(foo(foo(foo(foo)?)?)?)?

This is much more efficiently processed than the apparently equivalent
expressions

        foofoofoo(foo)?(foo)?(foo)?(foo)?
or
        foofoofoo((((foo)?foo)?foo)?foo)?

Why?  The first regexp can match a given number of occurences only a
single way, and it never has to backtrack more than once to do so.  The
second has several ways of matching a middle number of copies (for
example, there are six ways it could match 5 copies of "foo") and it
gets worse by approximately the square of the difference between the
max and the min.  The third has only one way of matching, but almost
always has to backtrack several times.


If there's a competent C programmer out there interested in fixing
procmail's regexps, please send me email.  I have a pointer to a good
advanced regexp package, some ideas about backwards compatability, and
some advice about getting started for anyone who's interested and has
the time to work on it.


Philip Guenther
Procmail Maintainer
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>