procmail
[Top] [All Lists]

regexp detail

1999-05-09 12:06:36
Harry Putnam saw this in my post,

| >  * ^Received:(.+$)+Received:(.+$)+Received:.*from \
| >    ([^ ]+\.)?worldnet\.att\.net.*by newsguy\.com.*for 
<reader(_at_)newsguy(_dot_)com>;

[Oops, there should have been another backslash before the last period, as in
 <reader(_at_)newsguy\(_dot_)com>; it probably won't hurt to have left it out, 
but it's
 better to have it in there.]

and this,

| >  * ! ^TO_reader(x(_at_)worldnet\(_dot_)att\(_dot_)net|@newsguy\.com)

and asked me,

| David, I haven't been able to experiment with this yet, just studying
| your approach though, I'm finding my limited experience with regexp
| isn't making it easy to see what you are doing here.
| 
| I wonder if you could give an account in english of what the RE above
| is doing.  (for the RE impaired)

One thing is that it contains a major procmailism: the use of a non-terminal
dollar sign (or a non-initial caret) to represent an embedded newline.  So
here we go,

 ^       newline at the end of the previous line or (putative newline at)
         the beginning of the search area
 Received:
 (.+$)   at least one non-newline character plus a newline, i.e., a non-empty
         line or enough of the end of a non-empty line to include at least
         one character before the newline and the newline
 (.+$)+  one or more of those ... the rest of the Received: line after
         "Received:" and perhaps more lines from the head, but empty lines
         are not allowed in it [because we don't want to go as far as the
         empty line at the neck]

In other words, from a Received: line on farther along in the message head
to another Received: line and then to (at least) a third Received: line, in
which we have to find this:

.*from ([^ ]+\.)?worldnet\.att\.net.*by newsguy\.com.*for 
<reader(_at_)newsguy\(_dot_)com>;

 .*    any string of zero or more non-newline characters [i.e., stay in the
       same line]
 [^ ]  any character except a space or a newline
 [^ ]+\.  one or more characters that are not spaces or newlines, plus a 
          literal period
 ([^ ]+\.)?  zero or one of that sequence [i.e., either that sequence or
             no text at all]

The reason for ([^ ]+\.)? is that maybe there will, maybe there won't be a
machine name before "worldnet.att.net" and maybe the machine name will be
more than one level deeper in subdomains.  So perhaps the Received: line
will say "from worldnet.att.net" or "from somemachine.worldnet.att.net" or
"from somemachine.somecluster.worldnet.att.net" and we want to be prepared
for any of them.

The rest of the line should be no problem.

As to the other condition, ^TO_ is a procmail special token that matches most
ways of listing a recipient in the visible headers of a message.  (I think
Harry knew that already, though.)