Re: Extracting Email Address In From: Field

On Sun, 25 May 1997 23:20:09 -0700 (PDT),
Dave/WebMaster <ddave(_at_)ddave(_dot_)com> wrote:

On Mon, 26 May 1997, era eriksson wrote:

On Sun, 25 May 1997 20:06:59 -0700 (PDT),
Dave/WebMaster <ddave(_at_)ddave(_dot_)com> wrote:

I have run into a brick wall (in my mind) trying to simplify this procmail
recipe. The line below, where formail is called, needs to strip everything
in the From: field except the actual email address in <>'s. I can't seem
to come up with a way to do this with formail and figure sed would have to
be called. Sed is a bit beyond my comprehension. Any ideas? 
  FROM=`formail -ztxFrom:`

You see this used a lot;
FROM=`formail -rtzxTo:`

I use this in my working listserv recipe because it does look for the 
real reply to address, minus the colon: so procmail extracts from the To 
pseudo header field. There is a special case, read further into my original


Huh? There is no "pseudo header field" without a colon. If you leave
off the colon, you still get the normal To: field, plus potentially
Tomato:, Toast:, and Toenails:.

post, where I need to match the address in the From: field. I'm just 
trying to save a bit of fingerwork on my part. If I bounce a message to 
the list, I don't get a cc but the original poster does; which I am 
trying to avoid. I want to strip the name and <> brackets so I can keep a 
file of the bare email addys.


If you really want to get rid of the brackets, sed -e 's/[<>]//g'
should do that (but probably not be worth it -- I'd keep the brokets
in the file instead). 

If you really insist that you want the contents of the From: field
under all circumstances, you run into a bit of trouble because parsing
that is not trivial under all circumstances. But if you're content
with an approximative method, try running the following on the output
of formail -zxFrom: 
  sed -e 's/ *([^)]*) *//g' -e 's/.*<\([^>]*\)>.*/\1/g'

The sed substitution command is probably worth getting acquainted with
if you spend time thinking about these things. Let's take the first
one apart:

 -e       What follows is a line of sed script:
  s         Substitute (unconditionally):
   /          Here's the start of a regular expression:
     *          Any number of spaces, followed by
    (           an open paren, followed by
    [^)]*       any number of characters which are not closing parens
    )           followed by a closing paren, and
     *          again any number of spaces
   /         End of the first regexp; here's what to substitute with:
               (nothing)
   /         End of the substitution expression
   g         Do this globally (otherwise you only substitute the first
              occurrence on a line, which is probably not a big deal in
              this case)

The second one is very similar but does this on brokets instead of
parens. The escaped parens in this regular expression are "anchors"
rather like the start anchor \/ in Procmail. The \1 is a "back
reference", the number meaning the first anchored expression (the
stuff inside the brokets):

 -e       What follows is another line of sed script:
  s         Substitute (unconditionally):
   /          Here's the start of a regular expression:
    .*          Any character any number of times, followed by
    <           an open broket
    \(          (remember this spot for backrefs)
    [^>]*       and any number of characters which are not close brokets 
    \)          (remember up to here for the first backref)
    >           and a closing broket
    .*          and any characters any number of times
   /          Awright, here's what to substitute that with:
    \1          The stuff that matched between the first set of \( \):s
                 (you start counting from the outside; in \(a\(b\)\),
                  you would have "ab" in \1 and "b" in \2)
   /          Yup, that's it, nothing else

This can and will break with elaborate parenthesized comments in the
From: header -- RFC822 permits quite complicated expressions, although
in practice you rarely see anything other than the following three
variants:
  From: address(_at_)host(_dot_)domain(_dot_)com
  From: address(_at_)host(_dot_)domain(_dot_)com (I am the Walrus)
  From: The Walrus <address(_at_)host(_dot_)domain(_dot_)com>

In the regexps, you should probably use [       ]* (tab or space)
instead of just spaces, but I left that out out of laziness :-)

Hope this helps,

/* era */

-- 
Defin-i-t-e-ly. Sep-a-r-a-te. Gram-m-a-r.  <http://www.iki.fi/~era/>
 * Enjoy receiving spam? Register at <http://www.iki.fi/~era/spam.html>