procmail
[Top] [All Lists]

Re: invalid regexp

1996-12-09 15:56:51
  > Ok, I am on some lists which send their mail directly to each user,
  > rather than having some useful "To" line which I could match.  I also
  > want to not match majordomo stuff
  > 
  > If any of these are true, then don't match
  > 
  > * ! ^(Sender: nextppp(_at_)chinx1(_dot_)ThoughtPort(_dot_)COM|\
  >     From: Majordomo|\
  >     From Majordomo|\
  >     X-Mailing-List:|\
  >     BestServHost: lists.best.com|\
  >     X-Loop: TopTen(_at_)pobox(_dot_)com)
  > 
  > Would it be better to do this:
  > 
  > * ! ^Sender: nextppp(_at_)chinx1(_dot_)ThoughtPost(_dot_)COM
  > * ! ^(From|From:) Majordomo
  > * ! ^X-Mailing-List:
  > * ! ^BestServHost: lists.best.com
  > * ! ^X-Loop: TopTen(_at_)pobox(_dot_)com

I prefer the first way, and I would write it this way (I like using
indentation to make my regexps more readable):

   * ! ^(Sender: nextppp(_at_)chinx1\(_dot_)ThoughtPort\(_dot_)COM|\
         BestServHost: lists\.best\.com|\
         X-Loop: TopTen(_at_)pobox\(_dot_)com|\
         From(: *(.*[^a-zA-Z0-9_.-])?| 
)(Majordomo|ListServ|SmartList|procmail)|\
         X-Mailing-List:)

The expression ".*[^a-zA-Z0-9_.-]" matches an abitray string up to a
character which is not part of an email address.  This allows the regexp
to span and match an address like this:

    From: "Majordomo List Manager" majordomo
 
Don't forget that "." is a "wildcard" character, matching any
single character except newline.

The essence of your question is whether or not a single, long regexp is
faster (better) than several, short regexps.

The procmail regexps are first parsed and compiled into a byte-code
state machine, which is then used against the input byte stream.  Since
each regexp involves a parse and state-machine generator, it makes sense
that many conditions, each with regexps, incur an an additional overhead
of the setting up the parse and creating the compilation.

One might intutitively think that there is also an overhead to parsing,
compiling, and interpreting a longer regexp, and that the success/fail
condition is less easily determined, but this is actually not true.

The advantage of a state-machine is that, as the input stream bytes flow
bye, different branches are taken depending upon the character being
examined at the moment: in other words, each new character of input
causes a different branch (state) to be taken in the state-machine.  If
the state-machine runs out of states without successfully maching before
the input is consumed, then the match fails.

The only downside to a long regular expression is that it is more
complex for the human to maintain. 

If you wish to see some really long regular expressions, check out the
files "rc.submit" and "rc.request" in the SmartList package (a mailing
list package based on procmail, by the author of procmail).

<Prev in Thread] Current Thread [Next in Thread>