procmail
[Top] [All Lists]

Re: weirdness with trailing ^^ and with ^.*$

1996-07-17 18:47:20
David W. Tamkin <dattier(_at_)wwa(_dot_)com> wrote:
Long story, but I was looking for a search expression that would always be
found twice, regardless of the message text.  (No, it wasn't simply to score

* 1^1 H ?? ^(From .*)?$

thinking for sure that the header would always contain one From_ line and
one empty line, but it counted three matches and scored 3!  Why??  Note that
there is a space after "From" to prevent matching the From: header.

Simplify it to:

        * 1^1 H ?? ^From .*$
        * 1^1 H ?? ^$

Which is equivalent to what you wrote above.  You'll now notice that procmail
thinks there are *two* empty lines in the header of a mail.
Why is that?  Well, it's actually done this way to *avoid* surprises most
of the time :-).

This is how procmail thinks of the text it is searching (the header
of the mail in this case):

Say you have a mail that looks like (without the leading spaces, of course):
 From foo
 Subject: bar
 
 This is the body
 of the mail

Procmail will store this internally as one long string:

        "From foo\nSubject: bar\n\nThis is the body\nof the mail\n\n"

The " are the start and end delimiters, the \n characters are the newlines.
For reasons I'll explain below, procmail will regard the " delimiters as
being equivalent to \n characters for the purpose of matching.

So, if you restrict ourselves to just the header of the mail, it looks like:

        "From foo\nSubject: bar\n\n"

If you now tell procmail to look for:   * ^$
It will start doing a match for the character sequence:         \n\n
(two consecutive newlines).
        "From foo\nSubject: bar\n\n"
The first match will be here   ####, then it searches on, and find a second
        "From foo\nSubject: bar\n\n"
match,                    here   ###
Ergo, two empty lines.

The reason why the start and end delimiters are regarded as being equivalent
to newlines is that:
- Otherwise ^From could never match the first From_ line.
- Official extended regexp docs dictate that ^ and $ do not match real
  characters, they match the empty virtual space just before the first
  and just after the last character on a line.

So, if you tell procmail to:    * SOMEVAR ?? ^check$
And SOMEVAR=check
Then it will match, despite the fact that there are no newlines in there.

Anyhow, I finally found a working solution.  It seems that if you search the
head, procmail thinks it ends with *two* blank lines.  These three:

Indeed, elementary, my dear Watson :-).

So the simplest search that matches twice would be:

  * 1^1 H ?? ^$

The only time the header does *not* end with two newlines, is when the
mailmessage that has been fed to procmail does not contain or end with an
empty line.

 These, though, find only one match and score 1:

* 1^1 H ?? ^^From |($)$^^

and surely enough, these don't match at all:

* ($)$^^

I think that "$^^" is yet another problem.

Excellent detectivework.  Indeed.  Due to a (minor) parsing error,
procmail reads any $^ sequence to be equivalent with ^^
So, ($)$^^ becomes equivalent to (^)(^^)(^), which, obviously, can *never*
match.  This will be fixed in the upcoming version.

* 1^1 HOST ?? ^^.*$|^.*^^

It gets stranger.  If the variable used is unset or null, that last condition
finds an unlimited number of matches and reaches the maximum score (that's
one reason I'm using $HOST, which is never null or unset; the other reason is
that it never contains an embedded newline); why are ^^.*$ and ^.*^^ matching
an empty search area even once, let alone again and again?

The reason for the "once" should be clear by now.
The reason for the "again and again" is that it's a limitation/anomaly of
the search engine in procmail:

The string in which we search is:       ""

When you search for (^)^^ then you'll get a match where the first ^
will match the first ", and the trailing ^^ will match the trailing ".

Now, whenever the search does not end on a newline, procmail advances
one location, and starts the search anew.
If the search ended on a newline, procmail does *not* advance the location
and starts the search anew.
Say you have the string:                "abr\nac\nadabra"
And you look for ^a.*$
Now, the matches will walk as follows:  ######
                                            ######
                                                #########
As you see, the matches actually do overlap, but at the newlines.
So if the match ends in a newline, there is a special rule saying, start
at this spot for the next match.  The trouble is that this becomes a
bit tricky for empty texts and a search string as (^)^^
                                    ""
The matches will walk as follows:   ##
                                    ## etc.
Why does it match at the same spot?
Well, procmail maintains a pointer to the current location.  The current
location would be the start of the string (the character after the leading ").
It then starts searching, but finds that the match started one token
earlier (the ^ matches the virtual beginning of the line).  The end of
the match is at the trailing ".  That qualifies as a virtual newline,
so the pointer isn't advanced, and the whole charade starts over again,
matching the very same spot.

In other words, procmail cannot distinguish this regexp from an empty
match (if the searcharea is empty as well).
-- 
Sincerely,                                                          
srb(_at_)cuci(_dot_)nl
           Stephen R. van den Berg (AKA BuGless).

Up your accumulator.

<Prev in Thread] Current Thread [Next in Thread>