procmail
[Top] [All Lists]

Re: What's the difference between "^", "$" and "^^" in regexp?

1997-03-27 13:50:38
"Simeon ben Nevel" <Simeon(_dot_)Nevel(_at_)Schwab(_dot_)COM> writes:
In the procmailrc man pages, "^", "$" and "^^" in regexps are defined as 
follows:

^    - Start of a line
$    - End of a line
^^   - Anchor the expression at the very start of the search area, or if
      encountered at the end of the expression, anchor it at the very end
      of the search area.

In addition, "^" and "$" will match a new-line.

Unhappily, I haven't been able to find any explanation of what the
"search area" is and how it differs from the area that the normal 
"^" and "$" tokens apply to.

I'd appreciate it if somebody could/would provide some enlightenment?


Okay, an explanation of the "search area" bit.  By default, the search
area is the header of the messages, _including_the_trailing_blank_line_.
This can be changed using the 'B' and 'H' flags to be just the body
(the 'B' flag) or both combined (both the 'H' and 'B' flags).  Beyond
that you can also change the search area using the "var ??" condition
special.  If while scanning the beginning of a condition line (before
the regexp itself), procmail sees the name of a variable followed by
two question marks, procmail will make the search area be the value of
that variable, with the proviso that if the 'variable' is any of "H",
"B", "HB", or "BH", the procmail will instead use the header of the
message, the body of the message, or the entire message respectively
for the search area.  E.g.:

# Make searching the body the default
:0 B
# this will search the body
* foo|bar
# this will search the header
* H ?? foo|bar
# this will search the entire message
* HB ?? foo|bar
# ...as will this
* BH ?? foo|bar
# this will search the value of the HOME variable
* HOME ?? /(Net|home)/
{ }


Okay, now for the explanation of ^, $, and ^^.  '^^' will only match
the very beginning or the very end of the search area.  Thus, the
following condition:

* B ?? ^^foo^^

would only match if the body of the message *only* contained "foo", and
I mean *ONLY* "foo".  If it had a newline after it (as any real body
would), then there would be no match.  Similarly, the following
condition:

* H ?? ^^From:

would only match if the From: header was the very first header.


On the other hand, the '^' and '$' regular expressions not only match
the very beginning and end of the search area (respectively), but also
match any embedded newlines, so that the condition:

* H ?? ^From:

would match if the header contains a line beginning with "From:", and
the condition:

* H ?? ^Subject: foo$

would only match if the subject of the message is just "foo".


When do you use which?  I find that when matching against the header or
body of a message, I use '^' and '$', because I'm just trying to match
one line out of many.  The only time I really use '^^' is when matching
against variables and I want to match the actual beginning of the value
not on an embedded newline.


Also, why/how would one use the "\<" or "\>" tokens?  Again, the man
pages are less than explicit about the whole issue.


Huh?  What's inexplicit about the following:

     \< or \>  Match the character before or after a word.   They
               are  merely  a  shorthand for `[^a-zA-Z0-9_]', but
               can also match newlines.  Since they match  actual
               characters,  they  are  only  suitable  to delimit
               words, not to delimit inter-word space.


For an example use, how about matching the end of the user part of
an email address?

* ^TO_guenther\>

That would match

        To: guenther(_at_)gac(_dot_)edu
or
        To: guenther
but not
        To: guentherp(_at_)gac(_dot_)edu

Understand?


Philip Guenther