Re: Deconstruction of FROM

On Fri, 7 Nov 1997 13:15:23 -0500, "Nevel, Simeon"
<Simeon(_dot_)Nevel(_at_)Schwab(_dot_)COM> wrote about ^FROM_DAEMON:

I'm having trouble understanding the RE *after* the "From "
but before the "secondary" header field (#1) (Postmaster, 
daemon etc) and the one *after* the secondary header field (#2).
#1
([^>]*[^((_dot_)%(_at_)a-z0-9])?
I translate this to mean an Optional group of characters
consisting of zero or more characters that aren't a ">"
followed by a single character that isn't any of "(.%a-z0-9"


In the older copy of the manual I have here, this is just (.*[^(...])?
which would match any run of characters, followed by any character
which is not valid in an e-mail address (but not delve into
parenthesized coments). I believe the intent here is to skip any
unparenthesized comments, as in " System Postmaster Account " in the
line "From: System Postmaster Account <postmaster(_at_)site(_dot_)net>".

#2
(([^).!:a-z0-9][-_a-z0-9]*)?[%@>\t ][^<)]*(\(.*\).*)?)?$([^>]|$)
                                ^^^
That "\t " sequence *really* has me confused?  Is the "\t" 
sequence supposed to represent a tab?  I thought the procmail


I think this is only to make it obvious to the reader that that is a
tab. The source (config.h) actually has a literal tab here.

PS.  Am I correct in assuming that in a "character class"
(things between "[" and "]") that RE metacharacters *don't*
need to be escaped?


Yes. [$[^\] would match any one of the literal characters dollar,
opening bracket, caret, and backslash.

And here is the entire RE (for the context) with my comments):
(^
 (Precedence:.*(junk|bulk|list)
  |To: Multiple recipients of 
  |(((Resent-)?(From|Sender)|X-Envelope-From):
  |>?From )


The indentation here is misleading; I'd write this

    |To: Multiple recipients of
    |(((Resent-)?(From|Sender)|X-Envelope-From):
      |>?From)

   ([^>]*[^((_dot_)%(_at_)a-z0-9])?


Skip fluff, maybe. ("Maybe" as in "optionally". :-)

    (Post(ma?(st(e?r)?|n)|office)         Postoffice, Postman,


(and various other similar strings elided)

     (([^).!:a-z0-9][-_a-z0-9]*)?[%@>\t ][^<)]*(\(.*\).*)?)?$([^>]|$))
)


You'll notice the trailing expression here starts with a somewhat
similar character class to the one near the beginning. Also note that
several of these expressions are optional, i.e. governed by a ? after
the closing paren. 

    (([^).!:a-z0-9]   End of e-mail address token
      [-_a-z0-9]      Another alpha token
      )?              ... or maybe not;
     [%@>\t ]         Address separator -- either 
<address(_at_)(_dot_)(_dot_)(_dot_)> or
                        <address> or a bare address with whitespace
                        around it
     [^<)]*           Skip as long as we don't run into another
                        broketed address or end of comment
                        (presumably to prevent this from matching
                        inside parentesized comments in the first
                        place)
     (\(.*\).*)?      Skip optional parenthesized comments and
                        anything after them if found
    )?                ... or maybe not; maybe we just see an ...
   $                  ... end of line instead
   ([^>]|$)           Uh, I should know what this is supposed to do,
                        but I can't quite remember what it's for. I
                        think it had something to do with continued
                        header lines ... Anyone?

Actually, it would be very nice if these expressions were in fact
documented somewhere ...

/* era */

-- 
 Paparazzi of the Net: No matter what you do to protect your privacy,
  they'll hunt you down and spam you. <http://www.iki.fi/~era/spam/>

Re: Deconstruction of FROM_DAEMON