Re: Identify a .forward[ed] message

On 08 May 1999 21:40:32 -0700, Harry Putnam <reader(_at_)newsguy(_dot_)com>
wrote:

"David W. Tamkin" <dattier(_at_)Mcs(_dot_)Net> writes:

:0:
* ^Received:(.+$)+Received:(.+$)+Received:.*from \
([^ ]+\.)?worldnet\.att\.net.*by newsguy\.com.*for 
<reader(_at_)newsguy(_dot_)com>;
* ! ^TO_reader(x(_at_)worldnet\(_dot_)att\(_dot_)net|@newsguy\.com)

<...>

I wonder if you could give an account in english of what the RE above
is doing.  (for the RE impaired)


All of this is being matched against the headers of the message (in
the absence of a B flag).

^         Beginning of line
Received: (Literally)
(         A group consisting of
  .         any character except newline,
  +         one or more times,
  $         then a newline
)+        This group can be repeated as long as necessary
          for there to be a match. In slightly higher-level terms,
          this means "a Received: line, followed by zero or more
          arbitrary header lines"; in still other words, the first
          match has to begin with Received: but any additional
          occurrences of "any run of non-newlines, then a newline"
          matches on any header (including Received: lines if necessary)
Received: Another Received header
(.+$)+    followed by any number of arbitrary header lines, as above
Received: Look, here's another one
.*        where we permit any number of non-newline characters before
from      this literal string (there's a space after the m)
(         and a group consisting of
[^ ]+       and any length of non-space (and non-newline) characters
\.          followed by a literal dot
)?        or not (i.e. this group is optional; it allows a host name
          within worldnet.att.net)
worldnet  (literally)
\.        literal dot
att       ...

What follows is a group of strings, all bracketed by the .* regex
which stands for any amount of non-newline characters. The intent of
these "skips" is probably mostly to allow for a bit of variation on
systems where the Received: lines format is not strictly what David
expected. (Received: lines on non-Sendmail systems will look a lot
different from this, though. In other words, if Newsguy replaces what
they are using now -- presumably some sort of Sendmail, but some other
MTAs use basically the same Received: line format and I'm too lazy to
go back and check -- with e.g. qmail, your recipe will have to be
rewritten.) Presumably the "hard" part is knowing what you can expect
to remain unchanged and thus match on with some certainty, not the
regular expression in and of itself.

Let's step back a couple of notches and look at the whole Received:
line again, keeping in mind that we allow some arbitrary variations in
some parts, and just picking out the parts we hope will remain
constant in the messages we will be matching; and also substituting
some parts of the regex with "pseudocode" expressions (literal dots
etc).

Received: (at beginning of line)   This is a Received: header
from ({anyhost}.)?worldnet.att.net where the message came from WorldNet
by newsguy.com                     and it was accepted by newsguy
for <reader(_at_)newsguy(_dot_)com>;          and it was "for" this address

(There should be a \. between "newsguy" and "com" in this second
instance too.)

The second regular expression is very simple (if you understand what
^TO_ stands for) and it says that there mustn't be a match (!) on a
header like To:/Cc:/Resent-To:/etc (^TO_) where the address part of
that header is either readerx(_at_)worldnet(_dot_)att(_dot_)net or 
reader(_at_)newsguy(_dot_)com
(the "reader" part has been factored out of the parens for
efficiency). In other words, if the message was explicitly addressed
to one of these addresses, it was Bcc:ed (or something to that
effect).

(In princple, you can't rely on the "for youraddress" part if a
message could be Bcc:ed to many people at Newsguy who all have their
mail forwarded there from WorldNet, but this seems somewhat
theoretical.)

One should normally be rather paranoid about the .* regular expression
but in this case, you know you are matching parts of Received: lines
matching a certain pattern, and there is not a very big danger that
some parts of that pattern could overlap by mistake, as long as each
part of what you're looking for is anchored by a keyword such as
"from", "by", "for". Possibly I'd like to anchor those a little bit
more by adding a \< before each (i.e. "\<from " instead of "from " --
this might entail some additional changes to the overall expression,
though; you could instead change each occurrence of ".*" into the
expression "(.*\<)?" which reads roughly, "any string, as long as it
ends in a non-word character, or nothing at all (because the whole
expression is made optional by the trailing question mark)".)

/* era */

-- 
.obBotBait: It shouldn't even matter whether     <http://www.iki.fi/era/>
I am a resident of the state of Washington. <http://members.xoom.com/procmail/>
 * Sign the European spam petition! <http://www.politik-digital.de/spam/en/> *