At 13:05 2002-12-29 +0100, poff(_at_)sixbit(_dot_)org did say:
Oh dear - rather dissapointed that my from tags are so horrible :(
I didn't call them horrible. The results however will be far less than
satisfactory, and the process you're running needlessly consumes a LOT of
CPU cycles.
If these messages are actually sent by their respective authors (versus
through a mailing list), then the From_ line should contain their email
address as the first component following the header, and procmail can
extract that internally like so:
:0
* ^From[ ]+\/[^ ]+
{
FROMADDR=$MATCH
}
This however will not match the desired address for messages which are
redelivered through mailing lists (for instance, this message on the
procmail list), even when the From: header shows the address, because the
sender is the listowner, not the original author.
Interestingly, after I sent that email I figured out a way to cope with
addresses like yours, with name <address>
You should check that again -- mine is "address (name)" not "name <address>"
I test recipes within a "sandbox" - a testing environment that permits me
to easily toss a recipe into a procmail framework and them redirect a saved
mailbox at the recipes and see how they work with _real_ messages, but not
inflicting operator oversight on my live mailstream. When dealing with
recipes intended to parse some component from a regular message, it's a
good idea to throw as large a cross-section of email at it as possible so
as to identify "quirks". Running relative timing tests doesn't hurt either.
# obtain the from _address_only_ - I explicitly remove the Reply-To in case it
# has been set (because I want to force using From:). If a From: field
# doesn't exist, formail will resort to pulling the address from the From_
# line.
FROM=`formail -IReply-To: -rtzxTo:`
# username (I assume this was what your MUSER variable was supposed to be -
# in a review of the logs from an initial test though, your MUSER was almost
# ALWAYS empty).
:0
* FROM ?? ^\/[^(_at_)]+
{
MUSER=$MATCH
}
# domain
:0
* FROM ?? @\/.*
{
DOMAIN=$MATCH
}
(note that the above assumes typical address specification of "user(_at_)domain"
- not _routed_ addresses, uucp addresses, or some of the cryptic addresses
from days gone by).
Additionally, the ORA regexp book has a perl program for checking addresses
for RFC syntactic validity. I use it for verifying addresses as good
enough for mailing, but I don't use it to parse an address out to just a
raw address component - if you hack the perl a bit, it should net you a
worthwhile function to scrub an address field to just a raw address component.
Now, with nothing but an email address in the FROM variable, there's no
need for you to "clean up" the field using sed, or awk, which will
dramatically improve the speed of the processing, as the following timings
should demonstrate.
The numbers given here as well as the timing from yesterdays post should
have the "base" timing factored out -- the overhead for the extraction of
messages from the compressed archive, splitting through formail and the
sandbox overhead comes to:
34.55user 25.30system 1:35.78elapsed 62%CPU
You would subtract the above figures from the execution times to arrive at
a rough figure for the actual comparative processing speed for the procmail
recipes being evaluated (I've done that here).
Using the formail method to extract and scrub the From: field:
54.84user 50.46system 2:12.69elapsed 79%CPU
(minus the sandbox overhead, that's 36.91 elapsed)
Trusting that the From_ header contains the address you care to pay
attention to (really only a matter with mailing lists), which negates the
need to invoke formail:
36.27user 31.26system 1:41.24elapsed 66%CPU
(minus the sandbox overhead, that's just 5.46 seconds elapsed to
process the 1500+ messages in my recent procmail archive (which has grown
nominally since yesterday <g>) -- that's 1.6% of the time it takes your
original recipe to run -- that's 60.7 times faster - that's sort of like
beefing up a lawnmower engine to power a full-sized car, with horsepower to
spare).
Times from my post yesterday:
formail on EACH invocation:
186.87 user 213.31 system 7:12.68 elapsed 92%CPU
(adjusted: 336.90 seconds elapsed)
procmail extraction, then echo for each invocation:
143.20 user 182.19 system 6:02.37 elapsed 89%CPU
(adjusted: 266.59 seconds elapsed -- 26% faster than your original)
The big time suck with your recipe is in using AWK, which is abysmally
slow. It may be versatile, but I use it only for onesey-twosie things,
never for a frequent process (and even there, I'm more likely to churn out
a perl script to do the task).
BTW, in my reply yesterday, I had "xFROM" as a variable name - this wasn't
precisely a typo - I had edited something to store the results in a
different variable name for comparison, and when I copied the text directly
from the rcfile, I didn't edit it back.
If you'd like to continue this dialogue, please do me the courtesy of *NOT*
quoting the entirety of my replies at the foot of your messages. If you're
not responding to a specific portion, don't resend it.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail