procmail
[Top] [All Lists]

Re: Putting some text into the Subject

1997-06-15 12:22:00
On Sun, 15 Jun 97 13:17:25 -0400,
"Timothy J. Luoma" <luomat(_at_)peak(_dot_)org> wrote:
Here's an attempt which doesn't use external processes:

:0
* ^Subject: Read Receipt$
* ^X-Prefiled: mailer
* B ?? ^^( *$)*Your message regarding \/\
        .*$?.*[0-9][0-9]:[0-9][0-9]:[0-9][0-9] [ -+][0-9][0-9][0-9][0-9]
{
    # Desired string is now in $MATCH but might contain newlines
    ...
But this doesn't quite work.  The message:

Your message regarding "bug again" of Sat Jun 14 1997 12:27:25 -0400 was  
read on Sat Jun 14 1997 19:02:08 +0200.
<... Apparently, you want only the original subject from inside the quotes ...>

The fundamental problem, I guess, is that the first line of the
message is wrapped at some suitable point, which could be different
each time. We should probably canonicalize the whole thing back onto a
single line and then see what we can do to post-process it.
  You could tighten up the regular expression to always match exactly
the first two lines if you can be positive that the message will
always be long enough to be wrapped over two lines (exactly). 
  With regard to your sed script, you could always take advantage of
the maximal matching to make sure you delete everything that is not a
quotation mark after the last quotation mark. 

    :0B
    * ^^Your message regarding "\/[^"]*$[^"]*
    { }
    :0EB
    * ^^Your message regarding "\/[^"]*

    # MATCH should now contain the original Subject on one or two lines
    # See the other script I posted for gluing it back together
    ...

This will of course, again, fail if the Subject line contains
suitably placed quotation marks of its own. If you're positive that
the Subject will always span a single line, or always span two lines,
then you can tighten up the regular expression accordingly. But at
this point, reverting to sed is most likely the sane thing to do. You
can in fact make it always grab exactly the first two lines and trim
off the well-know tail, and end up with the original subject under all
circumstances. 

Let's say we always grab exactly two lines with sed, glue them
together, and trim off the quotation marks and time stamps:

    # Date stamp regex:
    #    Mon Jan  1 1904 23:59:59 -2400
    #        M    o    n     J    a    n           1     1    9    0    4
    STAMP="[A-Z][a-z][a-z] [A-Z][a-z][a-z] [ 123][0-9] [1-2][0-9][0-9][0-9] \
[0-2][0-9]:[0-5][0-9]:[0-5][0-9] [-+][0-9][0-9][0-9][0-9]"
# 2    3  :  5    9  :  5    9     -   2    4    0    0

    :0b
    SUBJ=| sed -e N -e'2{' -e 'y/\n/ /' -e 's/  */ /g' \
        -e 's/^Your message regarding "//' \
        -e 's/of '"$STAMP"' was read on '"$STAMP"'.$//' -eq -e'}'

Things get harder if the time stamp is allowed to contain a time zone
token like "EST" before the offset from UTC (i.e. EET +0200 or even
EET DST +0300). 

Or, quite simply, we can remove everything up to the first quotation
mark and everything after the last (including the quotation marks, of
course):

    :0b
    SUBJ=| sed -e N -e'2{' -e 'y/\n/ /' -e 's/  */ /g' \
        -e 's/^[^"]*"//' -e 's/"[^"]*$/' -eq -e'}'

The sample you quoted had some trailing whitespace after the "was" so
I included a snippet to collapse multiple spaces, too. (You might want
to augment it to collapse tabs also.) (And on my copy of sed, the \n
translation is not necessary, but I'm assuming this shouldn't be
relied on. Another host where I tested doesn't even understand \n, and
does require the newline to be translated before it will match. That's
sed for you. Drats.)

I was under the impression that you wanted to grab either of the two
time stamps from the generated reply but of course, I suppose the
message you send off should have a good enough time stamp of its own.

/* era */

guessing David will one-up me as usual :-)

-- 
Defin-i-t-e-ly. Sep-a-r-a-te. Gram-m-a-r.  <http://www.iki.fi/~era/>
 * Enjoy receiving spam? Register at <http://www.iki.fi/~era/spam.html>