Re: CLEANTO anyone?

At 20:11 2004-05-29 -0700, Jim Osborn wrote:

Before I reinvent a wheel, I thought I'd ask if anyone has a pointer
to a procmail recipe (using some clever MATCH magic perhaps or
maybe with a bit of additional sed) that would extract all the
address portions, minus comments, from a set of commented addresses?

I'm guessing you're using this CLEANTO variable name because you've seencode that performs CLEANFROM?

Making a COPY of the message and then diddling with the Reply-To: header(easier than setting the From: and then DELETING any Reply-To:) is one wayto get formail to return a scrubbed address. Unfortunatley, it parses aplural of addresses down to just the first address in the list:


To: "Dave" <dave(_at_)example(_dot_)com>, (Sue) sue(_at_)example(_dot_)com

would, when manipulated into a Reply-To: and crammed through 'formail-rtzxTo:', result in "dave(_at_)example(_dot_)com"

I'm afraid that unless someone else has a nifty invocation of formail or a(probably) elaborate procmail recursion, you're looking at writing your ownexternal parser/scrubber (which of course is entirely doable - and passingjust the HEADERS is a good way to minimize process overhead when you go touse it).

:0
* 1^1 RAWTO ?? regex-that-ignores-comments

Refer to the REGEX book from ORA*. There's a perl regex in the appendixwhich quite completely parses email addresses. That perl code is includedin the downloadable examples for the book (in fact, to date, it's the ONLYdownloadable bit of code).

would also do the trick.  Counting, I'm sure, is much easier than
a full cleaning job, but if possible, it might be useful someday
to have the clean address set.

Commas AND @ might be easier to reliably count multiple recipients. I'veseen @ inside name text not infrequently (incl. when people clone theiraddress there, but often when it isn't even an address), not that commasdon't occassionally occur in comments as well.

You might also pipe to sed with a short expression to strip quoted bits(and angle brackets, etc around addresses). No, I don't have code forthat, and you'd want to test it against a large corpus, but the followingsprings to mind (pipe the string you want scrubbed, basically the extractedto/cc header or whatever). The following sed command was executed at theshell (escaping as per the shell I happen to use), with the address stringbeing piped at it:

echo "(Sally) <sally(_at_)host(_dot_)example(_dot_)com>, \"Joe Bob\"<joebob(_at_)host(_dot_)example(_dot_)com>, nobody(_at_)example(_dot_)com \"nobody\", dick<dick(_at_)example(_dot_)com>, <weed(_at_)example(_dot_)com> dick weed, <doofus(_at_)example(_dot_)com>doofus" | sed -e "s/\"[^\"]*\"\s*//g" -e's/\(^\|,\)\([^,]\)*<\([^<>]*\)>\([^,]\)*/\1\3/g' -e 's/([^\(\)]*)//g' -e"s/<\([^<>]*\)>/\1/g" -e "s/[ ]//g"

(despite the plurality of sed expressions, that's _ONE_ invocation of thesed program)


I tried to include many of the common syntaxes.

You could have name text with @ symbols in them.

Due to the primitive address cleanup, certain combinations of addresses andformatting might cause mixups. It is nearing 2am here, and I've got workto get done, or I'd gladly tinker further with this. You'll probably wantto revise some of the regexps to deal with comments on EITHER the leadingor the trailing side of the address token.

There's also the broad assumption that the headers are syntacticallycorrect - if say, you combine the contents of To: and Cc:


# unset the bugger
RECIPIENTS

:0
* ^To:\/.*
{
        RECIPIENTS=$MATCH
}

:0
* ^Cc:\/.*
{
        RECIPIENTS=$RECIPIENTS,$MATCH
}

if:

To: "Dave" <dave(_at_)example(_dot_)com> (
Cc: (Sue) sue(_at_)example(_dot_)com

Could result in funky parsing because of the stray paren (though, a quickcheck of this particular one shows it not to be a problem - the way I dealwith <> delimited addresses handily eliminates stray comments on EITHERside of the address).

Note that if the To: header is *EMPTY* (whether null or not - it may becomprised of whitespace), and yet the Cc: goes to insert a comma beforetacking on the Cc: contents, you can eliminate this extra cruft at the andof the sed (after whitespace removal):


        -e "s/\(^,\|,$\)//g"

and interrim addresses that resolve to null:

        -e "s/,,/,/g"

(though as per the sed expressions, a quoted name in the ABSENSE of anassociated address will end up returning an unquoted name rather thanDELETING it).

On the sample mail I ran the sed operator against, it seemed to do what wasexpected of it. However, this isn't code I've been using, so it is up toyou to further test and revise it. If you do discover problems, or makeimprovements, please provide details to the list.


As per a post which came through as I was composing this, it *DOES* clean:

        "<addr>" <addr>

suitably well. The part enclosed in doublequotes is recognized to be acomment and stripped out, even if what is inside the double quotes wouldotherwise appear to be an address.



* the book:

_Mastering Regular Expressions_, by Jeffrey E. F. Friedl
(Sebastopol, Calif: O'Reilly and Associates, 1997)
ISBN 1-56592-257-3
 http://www.oreilly.com/catalog/regex/

It's a must-have for the serious unix user. Grep, perl, sed, awk,procmail, php -- so many tools have regexp support.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail