At 20:11 2004-05-29 -0700, Jim Osborn wrote:
Before I reinvent a wheel, I thought I'd ask if anyone has a pointer
to a procmail recipe (using some clever MATCH magic perhaps or
maybe with a bit of additional sed) that would extract all the
address portions, minus comments, from a set of commented addresses?
I'm guessing you're using this CLEANTO variable name because you've seen
code that performs CLEANFROM?
Making a COPY of the message and then diddling with the Reply-To: header
(easier than setting the From: and then DELETING any Reply-To:) is one way
to get formail to return a scrubbed address. Unfortunatley, it parses a
plural of addresses down to just the first address in the list:
To: "Dave" <dave(_at_)example(_dot_)com>, (Sue) sue(_at_)example(_dot_)com
would, when manipulated into a Reply-To: and crammed through 'formail
-rtzxTo:', result in "dave(_at_)example(_dot_)com"
I'm afraid that unless someone else has a nifty invocation of formail or a
(probably) elaborate procmail recursion, you're looking at writing your own
external parser/scrubber (which of course is entirely doable - and passing
just the HEADERS is a good way to minimize process overhead when you go to
use it).
:0
* 1^1 RAWTO ?? regex-that-ignores-comments
Refer to the REGEX book from ORA*. There's a perl regex in the appendix
which quite completely parses email addresses. That perl code is included
in the downloadable examples for the book (in fact, to date, it's the ONLY
downloadable bit of code).
would also do the trick. Counting, I'm sure, is much easier than
a full cleaning job, but if possible, it might be useful someday
to have the clean address set.
Commas AND @ might be easier to reliably count multiple recipients. I've
seen @ inside name text not infrequently (incl. when people clone their
address there, but often when it isn't even an address), not that commas
don't occassionally occur in comments as well.
You might also pipe to sed with a short expression to strip quoted bits
(and angle brackets, etc around addresses). No, I don't have code for
that, and you'd want to test it against a large corpus, but the following
springs to mind (pipe the string you want scrubbed, basically the extracted
to/cc header or whatever). The following sed command was executed at the
shell (escaping as per the shell I happen to use), with the address string
being piped at it:
echo "(Sally) <sally(_at_)host(_dot_)example(_dot_)com>, \"Joe Bob\"
<joebob(_at_)host(_dot_)example(_dot_)com>, nobody(_at_)example(_dot_)com \"nobody\", dick
<dick(_at_)example(_dot_)com>, <weed(_at_)example(_dot_)com> dick weed, <doofus(_at_)example(_dot_)com>
doofus" | sed -e "s/\"[^\"]*\"\s*//g" -e
's/\(^\|,\)\([^,]\)*<\([^<>]*\)>\([^,]\)*/\1\3/g' -e 's/([^\(\)]*)//g' -e
"s/<\([^<>]*\)>/\1/g" -e "s/[ ]//g"
(despite the plurality of sed expressions, that's _ONE_ invocation of the
sed program)
I tried to include many of the common syntaxes.
You could have name text with @ symbols in them.
Due to the primitive address cleanup, certain combinations of addresses and
formatting might cause mixups. It is nearing 2am here, and I've got work
to get done, or I'd gladly tinker further with this. You'll probably want
to revise some of the regexps to deal with comments on EITHER the leading
or the trailing side of the address token.
There's also the broad assumption that the headers are syntactically
correct - if say, you combine the contents of To: and Cc:
# unset the bugger
RECIPIENTS
:0
* ^To:\/.*
{
RECIPIENTS=$MATCH
}
:0
* ^Cc:\/.*
{
RECIPIENTS=$RECIPIENTS,$MATCH
}
if:
To: "Dave" <dave(_at_)example(_dot_)com> (
Cc: (Sue) sue(_at_)example(_dot_)com
Could result in funky parsing because of the stray paren (though, a quick
check of this particular one shows it not to be a problem - the way I deal
with <> delimited addresses handily eliminates stray comments on EITHER
side of the address).
Note that if the To: header is *EMPTY* (whether null or not - it may be
comprised of whitespace), and yet the Cc: goes to insert a comma before
tacking on the Cc: contents, you can eliminate this extra cruft at the and
of the sed (after whitespace removal):
-e "s/\(^,\|,$\)//g"
and interrim addresses that resolve to null:
-e "s/,,/,/g"
(though as per the sed expressions, a quoted name in the ABSENSE of an
associated address will end up returning an unquoted name rather than
DELETING it).
On the sample mail I ran the sed operator against, it seemed to do what was
expected of it. However, this isn't code I've been using, so it is up to
you to further test and revise it. If you do discover problems, or make
improvements, please provide details to the list.
As per a post which came through as I was composing this, it *DOES* clean:
"<addr>" <addr>
suitably well. The part enclosed in doublequotes is recognized to be a
comment and stripped out, even if what is inside the double quotes would
otherwise appear to be an address.
* the book:
_Mastering Regular Expressions_, by Jeffrey E. F. Friedl
(Sebastopol, Calif: O'Reilly and Associates, 1997)
ISBN 1-56592-257-3
http://www.oreilly.com/catalog/regex/
It's a must-have for the serious unix user. Grep, perl, sed, awk,
procmail, php -- so many tools have regexp support.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail