procmail
[Top] [All Lists]

Re: what regexps work?

1997-04-11 23:36:00
Eli the Bearded <process(_at_)qz(_dot_)little-neck(_dot_)ny(_dot_)us> wrote:
James L. McGill <fishbowl(_at_)fotd(_dot_)netcomi(_dot_)com> responded to my
I'd really like to see, for version 4 for example,
true backreferences, instead of the uninteresting $MATCH  \/

In a strict mathematical sense, backreferences are not a part
of regular languages and hence regular expressions. They also
force the use of an NFA engine, which is less efficient for
large blocks of test than a DFA. DFA's require more complex
regexp compiling, I don't know at what quantity of text the
faster processing offsets that. If one assumes that headers
will be most of the text scanned, than you can assume small
quantities of text. I have not RTFSed procmail, I don't
know what it uses.

Procmail uses a parallel parsing NFA that never looks back and
never looks ahead.  This allows for some interesting characteristics:
- "Simple" compilation (since it's parallel parsing, it's obviously not
  as simple as a normal NFA).
- A well defined upper bound on the maximum needed "stack" space (limited by
  the number of nodes in the NFA).
- The maximum length of the match in the text is truly unlimited.
- There is a well defined upper bound on the time needed to scan
  for a certain regexp in a certain text.  I.e. the time needed is bound
  by a ceiling of number_of_nodes_in_the_NFA*time_needed_to_process_one_node*
  length_of_the_text_to_scan
  Though, in a typical regexp (even in complex ones) the time needed to
  scan the entire text is closer to
  2*time_needed_to_process_one_node*length_of_the_text_to_scan
- Note that the time needed to scan for a regexp increases *less* than
  linearly with the length of the regexp, regardless of its complexity.
  Normal NFAs tend to have a scantime that increases *more* than
  linearly with the length of the regexp, and it even can factor in
  the length of the text more than once.

The downside is that this parallel-NFA does not allow (an easy implementation
of):
- Zero-width magic tokens.
- Backreferences (a real killer for memory efficiency).

I think, in a sense, procmail's parallel-NFA parser is sort of a hybrid
between an NFA and a DFA that is being computed at runtime (on a need to
know basis).  But, please take this last assertion with a grain of salt,
since I studied physics, and never did anything on FA-theory (I just
programmed the engine, maybe that's why it turned out a bit unconventional :-).

construction.  I would not expect a full perl implementation
but why couldn't procmail actually use egrep or something?

Forking a seperate process is much slower. Grabbing regexp

Actually, early procmail versions did exactly this.  But as soon as procmail
incorporated the full POSIX egrep syntax, that solution was ditched.

libraries from Gnu egrep on the other hand....

Is much slower as well.

I can't get this to work, though:

      :0:
      * ? perl -e ' \
      $val=0; \
      while (<STDIN>) { \
       if ( /^$/ ){ $val && exit 0; exit 1; } \
       $val += /^From:.*[ <]([^\(_at_)]+)\@(\1|[^.]{13,})\.(com|net)([ 
]|$)/i; \
      } '
      $JUNK

Somewhere along the way something seems to be doing varible
interpolation on it.

Try running with VERBOSE=yes.  You should see what arguments procmail
is passing to perl.  I don't see any apparent why the above should
fail though.
-- 
Sincerely,                                                          
srb(_at_)cuci(_dot_)nl
           Stephen R. van den Berg (AKA BuGless).

A sign seen at the local pizza place: "DO NOT CARRY TAKE-OUT BOXES BY HANDLES"

<Prev in Thread] Current Thread [Next in Thread>