Re: Most efficient recipe [was: syntax enhancement dreams]

areray(_at_)io(_dot_)com (Rocket Ray) writes:
...

Which gets into something I've been wondering:  suppose one is on three
mailing lists which procmail will sort to a special file.  My
understanding of what's been said is that it's faster for procmail to do
this internally, i.e.:

:0:
* (\
   ^From:.*first_mailing_list|\
   ^Sender:.*second_mailing_list|\
   ^Resent-From:.*third_mailing_list\
  )
mailinglist_folder

is faster than:

:0:
* ? egrep -s -f $HOME/.procmail/file_of_mailinglist_regexps
mailinglist_folder

Am I right that the first recipe would be faster than the second?  And


Yes.

if so, at roughly what point would the second recipe become faster-- in
other words, if one is on 50 mailing lists instead of three, would the
first recipe still be faster?


For fifty I would still say procmail, but if you have a good egrep that
does lazy determination of the state machine, then I suspect it'll
eventually be faster.  I'd try timing it at that point, 

Actually, let's actually do that.  I just timed the running of procmail
across the 2307 message in my inbox (that includes several hundred deleted
messages, as I use MH and deleted messages are just renamed instead of
deleted) with the following two (summarized) procmailrcs:

------------------------
        :0
        * ^From:.*\<(\
        address|\
        address|\
        ...\
        )\>
        { FOO }
        HOST
------------------------
        :0
        * H ?? ? egrep -qf /tmp/f.grep
        { FOO }
        HOST
------------------------

/tmp/f.grep contained the same regexp as appears in the condition in
the first procmailrc.  Note that we even limit what we feed to egrep to
the header using the "H ??" bit, and the -q flag tells egrep (well, GNU
grep) to not even print the matching lines.  The grep used was GNU grep
2.0, which does do lazy-state determination, and since we don't use
back-references, it should be faster than most any other grep.

Oh yeah, the list of addresses searched for was generated by pulling 25
addresses from the the Publically Accessible Mailing Lists list, then
rot-13'ing them to get another 25.  The rot-13'ing may seem
unreasonable, but if you actually think about it you'll realize that
they look equally plausible to the grep.  You'll also notice that I'm
not actually subscribed to any of these lists (or their rot-13'ed
versions!), but that means we'll be exercising the worst case senario
(no match in the entire header).  Okay, enough caveats, here's the
numbers:

regexp inside the procmailrc:
Files:  2307
user:   114.41
system: 136.99
total:  251.40

regexp done via egrep:
Files:  2307
user:   1462.84
system: 312.35
total:  1775.19


A factor of seven is a lot in my book, so it looks like it'll take a
_really_ large number of lists to make the egrep faster.


Philip Guenther