procmail
[Top] [All Lists]

Re: procmail blues: half-shod solution...

2002-12-29 14:04:55
At 13:05 2002-12-29 +0100, poff(_at_)sixbit(_dot_)org did say:
Oh dear - rather dissapointed that my from tags are so horrible :(

I didn't call them horrible. The results however will be far less than satisfactory, and the process you're running needlessly consumes a LOT of CPU cycles.

If these messages are actually sent by their respective authors (versus through a mailing list), then the From_ line should contain their email address as the first component following the header, and procmail can extract that internally like so:

:0
* ^From[        ]+\/[^  ]+
{
        FROMADDR=$MATCH
}

This however will not match the desired address for messages which are redelivered through mailing lists (for instance, this message on the procmail list), even when the From: header shows the address, because the sender is the listowner, not the original author.

Interestingly, after I sent that email I figured out a way to cope with
addresses like yours, with name <address>

You should check that again -- mine is "address (name)" not "name <address>"

I test recipes within a "sandbox" - a testing environment that permits me to easily toss a recipe into a procmail framework and them redirect a saved mailbox at the recipes and see how they work with _real_ messages, but not inflicting operator oversight on my live mailstream. When dealing with recipes intended to parse some component from a regular message, it's a good idea to throw as large a cross-section of email at it as possible so as to identify "quirks". Running relative timing tests doesn't hurt either.


# obtain the from _address_only_ - I explicitly remove the Reply-To in case it
# has been set (because I want to force using From:).  If a From: field
# doesn't exist, formail will resort to pulling the address from the From_
# line.
FROM=`formail -IReply-To: -rtzxTo:`

# username (I assume this was what your MUSER variable was supposed to be -
# in a review of the logs from an initial test though, your MUSER was almost
# ALWAYS empty).
:0
* FROM ?? ^\/[^(_at_)]+
{
        MUSER=$MATCH
}

# domain
:0
* FROM ?? @\/.*
{
        DOMAIN=$MATCH
}

(note that the above assumes typical address specification of "user(_at_)domain" - not _routed_ addresses, uucp addresses, or some of the cryptic addresses from days gone by).


Additionally, the ORA regexp book has a perl program for checking addresses for RFC syntactic validity. I use it for verifying addresses as good enough for mailing, but I don't use it to parse an address out to just a raw address component - if you hack the perl a bit, it should net you a worthwhile function to scrub an address field to just a raw address component.

Now, with nothing but an email address in the FROM variable, there's no need for you to "clean up" the field using sed, or awk, which will dramatically improve the speed of the processing, as the following timings should demonstrate.

The numbers given here as well as the timing from yesterdays post should have the "base" timing factored out -- the overhead for the extraction of messages from the compressed archive, splitting through formail and the sandbox overhead comes to:

        34.55user 25.30system 1:35.78elapsed 62%CPU

You would subtract the above figures from the execution times to arrive at a rough figure for the actual comparative processing speed for the procmail recipes being evaluated (I've done that here).


Using the formail method to extract and scrub the From: field:

        54.84user 50.46system 2:12.69elapsed 79%CPU
        (minus the sandbox overhead, that's 36.91 elapsed)

Trusting that the From_ header contains the address you care to pay attention to (really only a matter with mailing lists), which negates the need to invoke formail:

        36.27user 31.26system 1:41.24elapsed 66%CPU
(minus the sandbox overhead, that's just 5.46 seconds elapsed to process the 1500+ messages in my recent procmail archive (which has grown nominally since yesterday <g>) -- that's 1.6% of the time it takes your original recipe to run -- that's 60.7 times faster - that's sort of like beefing up a lawnmower engine to power a full-sized car, with horsepower to spare).

Times from my post yesterday:

formail on EACH invocation:
        186.87 user 213.31 system 7:12.68 elapsed 92%CPU
        (adjusted: 336.90 seconds elapsed)

procmail extraction, then echo for each invocation:
        143.20 user 182.19 system 6:02.37 elapsed 89%CPU
        (adjusted: 266.59 seconds elapsed -- 26% faster than your original)

The big time suck with your recipe is in using AWK, which is abysmally slow. It may be versatile, but I use it only for onesey-twosie things, never for a frequent process (and even there, I'm more likely to churn out a perl script to do the task).


BTW, in my reply yesterday, I had "xFROM" as a variable name - this wasn't precisely a typo - I had edited something to store the results in a different variable name for comparison, and when I copied the text directly from the rcfile, I didn't edit it back.


If you'd like to continue this dialogue, please do me the courtesy of *NOT* quoting the entirety of my replies at the foot of your messages. If you're not responding to a specific portion, don't resend it.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.


_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail