procmail
[Top] [All Lists]

Re: Procmail experiments -- good methods

1999-05-17 06:49:49
On 16 May 1999 18:03:06 -0700, Harry Putnam <reader(_at_)newsguy(_dot_)com>
wrote:
era eriksson <era(_at_)iki(_dot_)fi> writes:
Since the significance of the date on the From_ line is pretty
much zero here (if it isn't, how about you derive it from the
Date: header of the digested message [a very imprecise science]
instead of the time when you happened to process it?), you might
Tell me more about deriving the date from the Date: header if you
have time.

There have been some threads about that from time to time. There is a
reference to a thread on this list in conjunction with the pointer to
the `mdate' program on <http://www.iki.fi/era/procmail/links.html#mdate>
... apart from that, I remember some talk about this in comp.mail.misc
a long time ago, and there was a brief thread on the spamtools list (I
think it was about where to find the specs for what exactly is
allowed) in about December--January or so.

Basically, the bottom line is that date formats vary a lot and that
it's not possible to always get a valid date from a Date: header.
(Earlier this year, I had a script that checked Date: headers on
Usenet messages for a few weeks. One problem I stumbled over was that
some newsreaders put in the name of the month in a language you might
not be able to guess [I was able to deduce that some of them were in
KOI-8-R Cyrillic and a lot of others were from hosts in Korea, not to
mention German and Finnish which I happen to be able to read, but you
might not] which are thus impossible to map back to a canonical date.
I'd be surprised if there were not mail clients that did this, too.)

On 16 May 1999 18:29:03 -0700, Harry Putnam <reader(_at_)newsguy(_dot_)com>
wrote:
"David W. Tamkin" <dattier(_at_)Mcs(_dot_)Net> writes:
Now, where did I leave off?  Harry had this recipe,
One thing I haven't found a reference to is the ` -ep' in your script.
What is its function?
-e '/END--------------cut here-------------/q' -ep \

It's sed's "print" function. The script in its entirety, repeated here
slightly simplified for your convenience:

    sed -n -e '1,/xxx/d' -e '/yyy/q' -ep

reads like this in human terms:

    -n          Don't copy input to output unless instructed otherwise
    1,/xxx/d    From first line (1) through to the first match on
                the regular expression xxx, delete input and loop back
                to the start of this script
    /yyy/q      On (the first) occurrence of the regular expression yyy,
                close all file descriptors and quit. (This is better
                than reading and deleting every remaining line of
                input; sed will actually quit and leave the rest of
                the input "dangling". There might in the general case
                be a lot of it.)
    p           (For any input line which makes it through to this
                point,) copy input to output.

The talk of "copying" "input" to "output" are a simplification; sed
actually works in terms of an "input space" which you can modify, copy
to a "hold space", print, or otherwise process.

The sed script called here looks like:
3,/BEGIN.*cut here/d
/END.*cut here/, $d 
Solves the header stuff with less fanfare.. Not sure what the $d does
but it doesn't work with out it.

(The observation that you can keep the From_ line and Return-Path: by
keeping lines 1 and 2, and deleteing from line 3 onwards only, is a
good one!)

     /zzz/,$d   From the first line matching the regular expression
                zzz through to end of file ($), delete (d) the input
                line and fetch the next one and loop back to the start
                of the script. (You can probably see how it would,
                generally speaking, be more efficient to simply quit
                the script when you hit zzz. You have to restructure
                the script a little bit in order for this to work,
                along the lines shown by David's example, i.e. you
                don't want sed to print all lines by default, so you
                call it up with an -n option and put in a clause to
                print all input lines which make it to the end of the
                script.)

One minor aspect of all this is that while processing the archive
messages I sometimes get a phantom  "From foo(_at_)bar" showing up in the
experimental DEFAULT mail box, that has no other lines.
I thought it was my abuse of awk, somehow but it has shown up after
removing the awk part.  Haven't  seen yet where it is coming from.

Sounds like formail is passing in an "empty" message somehow. You
should be able to see where it comes from by looking in Procmail's log
file.

Hope this helps,

/* era */

-- 
.obBotBait: It shouldn't even matter whether     <http://www.iki.fi/era/>
I am a resident of the state of Washington. <http://members.xoom.com/procmail/>
 * Sign the European spam petition! <http://www.politik-digital.de/spam/en/> *