Re: procmail blues: half-shod solution...

At 13:05 2002-12-29 +0100, poff(_at_)sixbit(_dot_)org did say:

Oh dear - rather dissapointed that my from tags are so horrible :(

I didn't call them horrible. The results however will be far less thansatisfactory, and the process you're running needlessly consumes a LOT ofCPU cycles.

If these messages are actually sent by their respective authors (versusthrough a mailing list), then the From_ line should contain their emailaddress as the first component following the header, and procmail canextract that internally like so:


:0
* ^From[        ]+\/[^  ]+
{
        FROMADDR=$MATCH
}

This however will not match the desired address for messages which areredelivered through mailing lists (for instance, this message on theprocmail list), even when the From: header shows the address, because thesender is the listowner, not the original author.

Interestingly, after I sent that email I figured out a way to cope with
addresses like yours, with name <address>


You should check that again -- mine is "address (name)" not "name <address>"

I test recipes within a "sandbox" - a testing environment that permits meto easily toss a recipe into a procmail framework and them redirect a savedmailbox at the recipes and see how they work with _real_ messages, but notinflicting operator oversight on my live mailstream. When dealing withrecipes intended to parse some component from a regular message, it's agood idea to throw as large a cross-section of email at it as possible soas to identify "quirks". Running relative timing tests doesn't hurt either.



# obtain the from _address_only_ - I explicitly remove the Reply-To in case it
# has been set (because I want to force using From:).  If a From: field
# doesn't exist, formail will resort to pulling the address from the From_
# line.
FROM=`formail -IReply-To: -rtzxTo:`

# username (I assume this was what your MUSER variable was supposed to be -
# in a review of the logs from an initial test though, your MUSER was almost
# ALWAYS empty).
:0
* FROM ?? ^\/[^(_at_)]+
{
        MUSER=$MATCH
}

# domain
:0
* FROM ?? @\/.*
{
        DOMAIN=$MATCH
}

(note that the above assumes typical address specification of "user(_at_)domain"- not _routed_ addresses, uucp addresses, or some of the cryptic addressesfrom days gone by).

Additionally, the ORA regexp book has a perl program for checking addressesfor RFC syntactic validity. I use it for verifying addresses as goodenough for mailing, but I don't use it to parse an address out to just araw address component - if you hack the perl a bit, it should net you aworthwhile function to scrub an address field to just a raw address component.

Now, with nothing but an email address in the FROM variable, there's noneed for you to "clean up" the field using sed, or awk, which willdramatically improve the speed of the processing, as the following timingsshould demonstrate.

The numbers given here as well as the timing from yesterdays post shouldhave the "base" timing factored out -- the overhead for the extraction ofmessages from the compressed archive, splitting through formail and thesandbox overhead comes to:


        34.55user 25.30system 1:35.78elapsed 62%CPU

You would subtract the above figures from the execution times to arrive ata rough figure for the actual comparative processing speed for the procmailrecipes being evaluated (I've done that here).



Using the formail method to extract and scrub the From: field:

        54.84user 50.46system 2:12.69elapsed 79%CPU
        (minus the sandbox overhead, that's 36.91 elapsed)

Trusting that the From_ header contains the address you care to payattention to (really only a matter with mailing lists), which negates theneed to invoke formail:


        36.27user 31.26system 1:41.24elapsed 66%CPU

(minus the sandbox overhead, that's just 5.46 seconds elapsed toprocess the 1500+ messages in my recent procmail archive (which has grownnominally since yesterday <g>) -- that's 1.6% of the time it takes youroriginal recipe to run -- that's 60.7 times faster - that's sort of likebeefing up a lawnmower engine to power a full-sized car, with horsepower tospare).


Times from my post yesterday:

formail on EACH invocation:
        186.87 user 213.31 system 7:12.68 elapsed 92%CPU
        (adjusted: 336.90 seconds elapsed)

procmail extraction, then echo for each invocation:
        143.20 user 182.19 system 6:02.37 elapsed 89%CPU
        (adjusted: 266.59 seconds elapsed -- 26% faster than your original)

The big time suck with your recipe is in using AWK, which is abysmallyslow. It may be versatile, but I use it only for onesey-twosie things,never for a frequent process (and even there, I'm more likely to churn outa perl script to do the task).

BTW, in my reply yesterday, I had "xFROM" as a variable name - this wasn'tprecisely a typo - I had edited something to store the results in adifferent variable name for comparison, and when I copied the text directlyfrom the rcfile, I didn't edit it back.

If you'd like to continue this dialogue, please do me the courtesy of *NOT*quoting the entirety of my replies at the foot of your messages. If you'renot responding to a specific portion, don't resend it.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail