procmail
[Top] [All Lists]

Re: how does the grep work internally?

1997-06-23 15:21:00
Though I've deleted his entire text, era eriksson <era(_at_)iki(_dot_)fi> 
writes:
On Tue, 17 Jun 97 12:50 EDT, 
process(_at_)qz(_dot_)little-neck(_dot_)ny(_dot_)us 
(Eli the Bearded) wrote:
not something easily tested. 
How much gets fed to the internal "egrep" at a time? When scanning
the headers, is it called once for each header, or once for all the
headers? Or once for some fixed size chunk? And the body, once per
line, once per whole, or once per chunk? If the recipe checks body
and headers, will the egrep ever get them together?

No 'chunking' is done.  Ignoring scoring, the regexp engine will be
called once given the entire area to search, be it just the head, just
the body, or both.  Chunking is so nasty that you generally only do it
when you know that the area being matched won't fit (conveniantly) in
memory.  Since procmail operates on the assumption that it can slurp the
entire message, it can pass that guarantee onto the regexp engine.

When you're matching across the entire message, there's one blank line
between the header and body.  The body itself may start with blank lines
which would add to the visual gap, but as far as procmail and rfc822
are concerned, the body starts after the first blank line.  Note that
procmail internally treats the blank line as being part of the header,
so if you want to match a message whose last header is "Keywords:" you
would use the condition:

        * ^Keywords:.*$$

The two dollar signs match the two newlines that end the header.  No
need to use double carats to anchor it, as consecutive newlines can
only appear at the end of the header.

If you want to match across the header/body separation you just need to
include the two newlines:

        :0 HB
        * ^Keywords:.*$$Archive-Name:

Well...almost.  That'll match a message which has Keywords: and Archive-Name:
entirely in the body serapated by a blank line.  You need to force the
match to start in the header:

        :0 HB
        * ^^(.+$)*Keywords:.*$$Archive-Name:

The double carat anchors against the very beginning of the message; the
(.+$)* matches zero or more _non-empty_ lines (so it'll still be stuck in
the header), then comes what we're really looking for.

Personally, however, I would probably write the above using a different
technique:

        :0
        * H ?? ^Keywords:.*$$
        * B ?? ^^Archive-Name:

That seems to express what you want in a more direct fashion than the
single condition above.  I'll note that while the single condition is
faster for procmail to process, if breaking it into two condition makes
it easier for you to later modify or just understand, then you should
split it (in six months, will you remember why the leading (.+)$* is
there?).  *Your* time is much more valuable than CPU time.



...
When writing recipes that check for stuff like that, is it the egrep
that does it, or is the RE broken up internally to line fragments to
check? 
If a chunk model is used, is the chunk LINEBUF sized? (If I get a
piece of mail with a 20,000 byte subject, will I have problems
from procmail?)

LINEBUF only applies to lines from procmailrcs.  You generally only
have to worry about LINEBUF when you have a variable expansion or
command expansion (backquotes) that doesn't have an obvious and
reasonable bound on its size.  procmail will avoid overrunning its
LINEBUF length buffer when doing command expansions by ignoring the
extra output, so you're safe there, as long as data truncation is
fine.  Variable expansion isn't checked like that, so you can cause
procmail to coredump by doing something like:

        :0
        * ^Subject: \/.*
        |some-program $MATCH

then feeding procmail a message with a huge Subject: header field:
since no shell meta characters appear in the action, the action line
will be expanded and exec()ed by procmail directly instead of by the
shell.  On the otherhand, the following is fine:

        :0
        * ^Subject: \/.*
        |some-program $MATCH

The semicolon forces a shell invocation, and the shell *should* be
safe.  If you /bin/sh can buffer overrun on variable expansion, then
you're in more trouble than you know.

Action lines aren't the only place to watch your variable expansions.
Variable assignments and condition lines that have a leading dollar
sign also undergo expansion.  For example, this isn't safe:

        SUBJECT = `formail -x Subject:`
        NEWSUBJ = "Subject: $SUBJECT"

procmail won't buffer overrun in the first line, but a really long
subject could cause the second to do so.  The following should be
safe:

        NEWSUBJ = "Subject: `formail -x Subject:`"

but even then only if you're sure the shell is doing the expansion of
NEWSUBJ.

Note that matching against the value of a variable (using the "var ??"
condition special) is safe no matter what the size of the contents of
the variable.  The problem is when you _interpolate_ the variable
into something else.

Does that cover your questions?


Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>