procmail
[Top] [All Lists]

Re: CLEANTO anyone?

2004-05-30 03:21:28
At 20:11 2004-05-29 -0700, Jim Osborn wrote:
Before I reinvent a wheel, I thought I'd ask if anyone has a pointer
to a procmail recipe (using some clever MATCH magic perhaps or
maybe with a bit of additional sed) that would extract all the
address portions, minus comments, from a set of commented addresses?

I'm guessing you're using this CLEANTO variable name because you've seen code that performs CLEANFROM?

Making a COPY of the message and then diddling with the Reply-To: header (easier than setting the From: and then DELETING any Reply-To:) is one way to get formail to return a scrubbed address. Unfortunatley, it parses a plural of addresses down to just the first address in the list:

To: "Dave" <dave(_at_)example(_dot_)com>, (Sue) sue(_at_)example(_dot_)com

would, when manipulated into a Reply-To: and crammed through 'formail -rtzxTo:', result in "dave(_at_)example(_dot_)com"

I'm afraid that unless someone else has a nifty invocation of formail or a (probably) elaborate procmail recursion, you're looking at writing your own external parser/scrubber (which of course is entirely doable - and passing just the HEADERS is a good way to minimize process overhead when you go to use it).

:0
* 1^1 RAWTO ?? regex-that-ignores-comments

Refer to the REGEX book from ORA*. There's a perl regex in the appendix which quite completely parses email addresses. That perl code is included in the downloadable examples for the book (in fact, to date, it's the ONLY downloadable bit of code).

would also do the trick.  Counting, I'm sure, is much easier than
a full cleaning job, but if possible, it might be useful someday
to have the clean address set.

Commas AND @ might be easier to reliably count multiple recipients. I've seen @ inside name text not infrequently (incl. when people clone their address there, but often when it isn't even an address), not that commas don't occassionally occur in comments as well.

You might also pipe to sed with a short expression to strip quoted bits (and angle brackets, etc around addresses). No, I don't have code for that, and you'd want to test it against a large corpus, but the following springs to mind (pipe the string you want scrubbed, basically the extracted to/cc header or whatever). The following sed command was executed at the shell (escaping as per the shell I happen to use), with the address string being piped at it:

echo "(Sally) <sally(_at_)host(_dot_)example(_dot_)com>, \"Joe Bob\" <joebob(_at_)host(_dot_)example(_dot_)com>, nobody(_at_)example(_dot_)com \"nobody\", dick <dick(_at_)example(_dot_)com>, <weed(_at_)example(_dot_)com> dick weed, <doofus(_at_)example(_dot_)com> doofus" | sed -e "s/\"[^\"]*\"\s*//g" -e 's/\(^\|,\)\([^,]\)*<\([^<>]*\)>\([^,]\)*/\1\3/g' -e 's/([^\(\)]*)//g' -e "s/<\([^<>]*\)>/\1/g" -e "s/[ ]//g"

(despite the plurality of sed expressions, that's _ONE_ invocation of the sed program)

I tried to include many of the common syntaxes.

You could have name text with @ symbols in them.


Due to the primitive address cleanup, certain combinations of addresses and formatting might cause mixups. It is nearing 2am here, and I've got work to get done, or I'd gladly tinker further with this. You'll probably want to revise some of the regexps to deal with comments on EITHER the leading or the trailing side of the address token.

There's also the broad assumption that the headers are syntactically correct - if say, you combine the contents of To: and Cc:

# unset the bugger
RECIPIENTS

:0
* ^To:\/.*
{
        RECIPIENTS=$MATCH
}

:0
* ^Cc:\/.*
{
        RECIPIENTS=$RECIPIENTS,$MATCH
}

if:

To: "Dave" <dave(_at_)example(_dot_)com> (
Cc: (Sue) sue(_at_)example(_dot_)com

Could result in funky parsing because of the stray paren (though, a quick check of this particular one shows it not to be a problem - the way I deal with <> delimited addresses handily eliminates stray comments on EITHER side of the address).

Note that if the To: header is *EMPTY* (whether null or not - it may be comprised of whitespace), and yet the Cc: goes to insert a comma before tacking on the Cc: contents, you can eliminate this extra cruft at the and of the sed (after whitespace removal):

        -e "s/\(^,\|,$\)//g"

and interrim addresses that resolve to null:

        -e "s/,,/,/g"

(though as per the sed expressions, a quoted name in the ABSENSE of an associated address will end up returning an unquoted name rather than DELETING it).



On the sample mail I ran the sed operator against, it seemed to do what was expected of it. However, this isn't code I've been using, so it is up to you to further test and revise it. If you do discover problems, or make improvements, please provide details to the list.

As per a post which came through as I was composing this, it *DOES* clean:

        "<addr>" <addr>

suitably well. The part enclosed in doublequotes is recognized to be a comment and stripped out, even if what is inside the double quotes would otherwise appear to be an address.


* the book:

_Mastering Regular Expressions_, by Jeffrey E. F. Friedl
(Sebastopol, Calif: O'Reilly and Associates, 1997)
ISBN 1-56592-257-3
 http://www.oreilly.com/catalog/regex/


It's a must-have for the serious unix user. Grep, perl, sed, awk, procmail, php -- so many tools have regexp support.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>