procmail
[Top] [All Lists]

Re: Extract only address from header content?

1997-09-05 13:46:51
Excerpts from mail: (05-Sep-97) Extract only address from header content? by 
Mitsuru Furukawa
Following recipe captures eveything on To: header content including for
example "(Bassmkt)".

:0
* ^To:[       ]*\/[^  ].*
{
    TO_VALUE = $MATCH
}

How could I capture only the pure address part such as xxx(_at_)xxx(_dot_)xx ,
xxx(_at_)xxx(_dot_)xx(_dot_)xx , @@@..ix.netcom.com , 
@(_dot_)(_at_)ix(_dot_)netcom(_dot_)com , etc?

To do this correctly, you'd have to write an RFC822 mail address parser which
can be rather complicated to do correctly in any computer language let alone
in procmail. However, for the limited purpose of fighting spam, we can
probably concern ourselves only with the three most popular forms of e-mail
addresses

"Name that includes special characters" 
<userid(_at_)host(_dot_)domain(_dot_)com>
Name that does not have any special characters 
<userid(_at_)host(_dot_)domain(_dot_)com>
userid(_at_)host(_dot_)domain(_dot_)com (Comment inside parentheses)

and a possible fourth case consisting of any combination of those three.

:0
* ^To:[         ]*\/[^  ].*
{
     TO_VALUE = $MATCH

     :0
     * ! TO_VALUE ?? ^^(_dot_)*(_at_)(_dot_)*,.*@
     {

          :0
          * TO_VALUE ?? ^^"[^"]+"[      ]+\/[^  ].*
          * MATCH ?? ^^([^<]+[  ]+)?<\/[^>]+
          { TO_VALUE = $MATCH }

          :0
          * TO_VALUE ?? ^^[^    ]+(@[^  ]+)?[   ]*\(.*\)[       ]*$
          * TO_VALUE ?? ^^\/[^  (]+
          { TO_VALUE = $MATCH }
     }
}

Note: This code is untested. Someone want to double-check my regexps?

Explanation: The first condition extracts the contents of the To: field and
assigns it to $TO_VALUE as before. The next condition says not to try parsing
TO_VALUE if it has a more than one e-mail address in it. (That's not a
perfect test. For example, an "@" and a comma might both appear in a comment
or inside quotes and not be part of the address, but this is probably good
enough for fighting spam.) The next condition compares the $TO_VALUE from the
previous condition to the first kind of RFC822 address where the the address
starts with a name enclosed in quotes (and the e-mail address is enclosed in
angle brackets). This condition will assign everything after the string
enclosed in double quotes (and some whitespace) to $MATCH. The next condition
line handles the case where the address starts with an optional name that is
not enclosed in double quotes and the real e-mail address in inside angle
brackets. It extracts the string inside the angle brackets into $MATCH which
gets reassigned to $TO_VALUE. The next condition checks to see if $TO_VALUE
starts with an e-mail address and ends with a comment enclosed in
parentheses. If it does, the next line will extract the string from the
beginning of the field up to but not including the optional whitespace prior
to the open paren that starts the comment. The resulting $MATCH is reassigned
to $TO_VALUE.

Whew!

Note: This code is untested. Someone want to double-check me?

P.S. Is there web-page-retrieval software which could run from
UNIX shell account? If it exists, please let me know off-the-list.

Get the GNU utility `wget' from <ftp://prep.ai.mit.edu/>, or use `lynx'. (But
`wget' is really worth having.)

Later,
Ed

<Prev in Thread] Current Thread [Next in Thread>