procmail
[Top] [All Lists]

Re: What's the difference between "^", "$" and "^^" in regexp?

1997-03-27 13:47:27
On Thu, 27 Mar 1997 10:23:57 -0800,
"Simeon ben Nevel" <Simeon(_dot_)Nevel(_at_)Schwab(_dot_)COM> wrote:
Unhappily, I haven't been able to find any explanation of what the
"search area" is and how it differs from the area that the normal 
"^" and "$" tokens apply to.

The search area is all of the text that Procmail looks at when trying
to find a match. Thus "^^From " is to be expected to match on a normal
Berkeley mbox format file and "^^Subject" is not. The search area is
then the entire header of the message. Likewise you could use e.g.
"^Subject:[^\n]*$$" to check if the Subject: header is the last of the
header lines (and spans only a single line). 
  The B flag changes the search area to all of the body of the
message. 
  (In an ideal world you could then find people's .signatures using
    :0B 
    * ^-- *(\n[^\n]*){0,4}\n?$$
if the Procmail of this ideal world would use this slightly expanded
regexp syntax and people in this ideal world could agree to
standardize on the .signature format described by this regular
expression. [Hmm, bad example; the thing it's supposed to demonstrate
is that the regexp is anchored at the end of the whole message body
using `$$'.] ;-)

Also, why/how would one use the "\<" or "\>" tokens?  Again, the man
pages are less than explicit about the whole issue.

To take a recent example, From:(_dot_)*foo(_at_)bar will match something like
From: snahfoo(_at_)barbaric(_dot_)com when the intention was probably to only
match on the specific address foo(_at_)bar(_dot_)com(_dot_) You can achieve a 
bit more
security by saying \<foo(_at_)bar\>; now it will only match if foo(_at_)bar is
enclosed in non-alphabetic characters. 

Unlike egrep regular expressions, Procmail's \< is less general in
that it actually has to match a character, where egrep matches on the
"imaginary boundary" between anything [including nothing ;^] and an
alphabetic, or "word", character. 
  Probably an example would be in order here as explanation:
From:.*\<foo(_at_)bar\> will +not+ match From:foo(_at_)bar, whereas in egrep, it
would.
  If you want to cover the case where there is nothing after the
colon, you have to write something like From:(.*\<)?foo(_at_)bar\> -- this
says, if there's anything else on the line before foo(_at_)bar, it can be
either the empty string [the question marks after the paren covers
that] or any string provided that the character immediately adjacent
to foo(_at_)bar is not alphabetic.
  The reason why you don't need the same trick at the end of the
regexp is that \> can match a newline, i.e. what we'd normally
consider "the empty string".
  (And the reason you want to allow for something random before the
actual e-mail address is that people can put most anything in their
From: field but the e-mail address has to be in there somewhere in
unadultered form. The message you're reading right now is From: era
eriksson <reriksso(_at_)cc(_dot_)helsinki(_dot_)fi> but I could change the "era
eriksson" part very easily. [Changing the e-mail address part isn't
all that hard but changing that is essentially forgery whereas nobody
would get upset if I changed the name part.])

Hope this helps,

/* era */

-- 
Defin-i-t-e-ly. Sep-a-r-a-te. Gram-m-a-r.  <http://www.iki.fi/~era/>
 * Enjoy receiving spam? Register at <http://www.iki.fi/~era/spam.html>