procmail
[Top] [All Lists]

Re: stripping email addresses?

1999-11-12 02:23:15
On Thu, 11 Nov 1999 23:50:16 +0000 (GMT), James Stevenson
<mistral(_at_)stevenson(_dot_)zetnet(_dot_)co(_dot_)uk> wrote:
would it be possible to strip email addresss from emails a recived
i want to put them into an address book so that i can always find
the person by there name i keep losing peoples email address

Not really a function of Procmail, although perhaps it would be
convenient to run a script from within Procmail when you receive mail
from somebody who is not already in your database.

A simple Perl script is probably all you need to actually parse the
headers of a message looking for sender identification information.
Unfortunately, the format of RFC822 headers is needlessly complex, so
there's no way you can get away with a simple one-liner.

Here's some of the stuff you will have to cope with:

  * There are lots of different headers which can contain the
    information you're looking for. Perhaps you should look in them
    all, or perhaps you only want to select a few of them, based on
    whatever else is there (i.e. don't look at Sender: if there is
    also a From:, etc)

        Resent-From:
        Resent-Sender:
        From:
        Sender:
        Reply-To:
         etc

  * The format of those headers is not strictly defined. There's bound
    to be an email address in there somewhere, unless the message was
    forged, of course, but there is not guaranteed to be a real name
    anywhere. The following are widely seen:

        From: era eriksson <era(_at_)iki(_dot_)fi>
        From: era(_at_)iki(_dot_)fi
        From: era(_at_)iki(_dot_)fi (era eriksson)

    and the following are entirely possible:

        From: era(_at_)iki(_dot_)fi (this is a comment with no useful content)
        From: "'era(_at_)iki(_dot_)fi'" <era(_at_)iki(_dot_)fi>    ; Microsoft 
braindeath
        From: era (thass me!) eriksson <era(_at_)iki(_dot_)fi> (fer sure)

    You can probably get half-decent results by simply ignoring these
    things. People don't tend to change their settings very often once
    they have them set up, although they might of course be using
    slightly different settings on different accounts or different
    physical machines. You can have many "real names" corresponding to
    one address, and many addresses corresponding to one "real name".

With all that out of the way, try this on a mailbox full of messages:

    formail -cxFrom: -xSender: -xResent-From: -xResent-Sender: -s < mailbox 

Here's what I get out of my Procmail mailbox:

 $ formail -czxFrom: -xSender: -xResent-From: -xResent-Sender: -s < procmail |
sort -u
 "Gerhard Landauf" <landauf(_at_)teleweb(_dot_)at>
 Angelina Paunovic <angelina(_at_)nccdns(_dot_)moc(_dot_)kw>
 David Gikandi <myprocmail(_at_)yahoo(_dot_)com>
 Frederic Trudeau <thangel(_at_)CAM(_dot_)ORG>
 James Stevenson <mistral(_at_)stevenson(_dot_)zetnet(_dot_)co(_dot_)uk>
 Kajetan Beler <kajetan(_at_)nvg(_dot_)ntnu(_dot_)no>
 Venky <venky(_at_)del2(_dot_)vsnl(_dot_)net(_dot_)in>
 era eriksson <era(_at_)iki(_dot_)fi>
 era eriksson <era+i(_at_)iki(_dot_)fi>
 procmail-request(_at_)informatik(_dot_)rwth-aachen(_dot_)de
 procmail(_at_)informatik(_dot_)rwth-aachen(_dot_)de

You could rely on the angle brackets and create a file of mappings,
tab separated, of all Real Name to <brackets> pairs:

 $ formail -czxFrom: -xSender: -xResent-From: -xResent-Sender: -s < procmail |
sort -u |
# Get rid of any tabs
tr '\011' ' ' |
# Get rid of any lines without a <broket> pair, reformat the remaining lines
sed -n \
 's/\(.*[^   ]\)[    ]*<\([^<>@      ]*(_at_)[^<>@   ]*\)>/\1        \2/p'
 "Gerhard Landauf"      landauf(_at_)teleweb(_dot_)at
 Angelina Paunovic      angelina(_at_)nccdns(_dot_)moc(_dot_)kw
 David Gikandi  myprocmail(_at_)yahoo(_dot_)com
 Frederic Trudeau       thangel(_at_)CAM(_dot_)ORG
 James Stevenson        mistral(_at_)stevenson(_dot_)zetnet(_dot_)co(_dot_)uk
 Kajetan Beler  kajetan(_at_)nvg(_dot_)ntnu(_dot_)no
 Venky  venky(_at_)del2(_dot_)vsnl(_dot_)net(_dot_)in
 era eriksson   era(_at_)iki(_dot_)fi
 era eriksson   era+i(_at_)iki(_dot_)fi

Now you can store this to a file, and create a small program which
greps entries out of this file.

(Some of the whitespace below consists of tabs, or a tab and a space.
You probably can't copy and paste from one window to another because
the tabs will be turned into spaces. If you can't figure out how to
handle the tabs, mail me back in private and I'll send you these
scripts as attachments.)

 $ formail -czxFrom: -xSender: -xResent-From: -xResent-Sender: -s < procmail |
sort -u |
# Get rid of any tabs
tr ' ' ' ' |
# Get rid of any lines without a <broket> pair, reformat the remaining lines
sed -n \
 's/\(.*[^   ]\)[    ]*<\([^<>@      ]*(_at_)[^<>@   ]*\)>/\1        \2/p' \
  >$HOME/addies

 $ cat >$HOME/bin/addies <<'HERE'
#!/bin/sh
# yeah, I always create programs with cat like this :-)  (not)
grep -i '^[^ ]*'"$1"'[^      ]*      ' $HOME/addies
HERE

 $ chmod +x $HOME/bin/addies

 $ addies stevenson
 James Stevenson        mistral(_at_)stevenson(_dot_)zetnet(_dot_)co(_dot_)uk

I imagine something like this would be what you were looking for? You
will have to modify and enhance it for your personal needs, of course.
If you want it to cover addresses other than ones following the Real
Name <address> convention, you will have to modify the sed script, for
example. And perhaps you would like grep to only look for entire words
(grep will happily report "System Operator" as matching "era" because
it contains the three letters e-r-a), or to also look at the address
field (I wrote addies so it doesn't look beyond the first tab for a
match). (Also, the sed script will leave in anything after the closing
broket. But maybe that's a feature.)

If you are not familiar with the tools used here, I would recommend
reading a Unix book. One of my favorites is "The Unix Programming
Environment" by Kernighan and Pike. It contains a lot of examples and
exercises something like the above. It's from 1984 so some of the
material needs to be taken with a grain of salt (how the terminal
works, for example; also, the material on how the entire directory
structure is organized on typical Unix systems is a bit dated) but
it's still a very good book, and I would imagine most University
libraries would have at least one copy.

Hope this helps,

/* era */

-- 
 Too much to say to fit into this .signature anyway: <http://www.iki.fi/era/>
  Fight spam in Europe: <http://www.euro.cauce.org/> * Sign the EU petition

<Prev in Thread] Current Thread [Next in Thread>