procmail
[Top] [All Lists]

Re: Some strange is done to subject line

2009-10-29 14:22:40
At 11:48 2009-10-29 -0500, John Simpson wrote:
On 10/29/09, Professional Software Engineering<PSE-L(_at_)mail(_dot_)professional(_dot_)org> wrote:> At 10:11 2009-10-29 -0500, Harry Putnam wrote:> > Subject:> =?utf-8?B?UmV0cm9zcGVjdCBub3RpZmljYXRpb24gZnJvbSBCSlAgKDEwLzI3LzIwMDk> >>> Are you new to MIME encoding, or just unaware that it can be used to encode> subjects (and name text even) in the header? If " * ^Subject:.*Retrospect " is not correct, then what should the recipe be ?

echo "Retrospect" | mimencode

will give you a string. I can't say I'd want to do this from within procmail each time, so if you only have a handful of things to match, you might try:

# If you change the match string, to get base-64 version, do something like:
# echo "Retrospect" | mimencode
# and punch in the result here, prefixed by "=?utf-8?B?"
# this will match an original plaintext or base-64 encoded subject.  Since
# this is a notification from a program, you shouldn't expect Re: or Fwd:
# prefixes, so the whitepace preceeding the subject keyword should be it.  If
# there WERE a reply prefix, this would be more complicated, because that
# would be part of the encoded subject (which offsets the BASE64 coding)
  :0:
  * ^Message-Id:(_dot_)*(_at_)reader\(_dot_)local\(_dot_)lan
  * ^Subject:[  ]*(Retrospect|=?utf-8?B?UmV0cm9zcGVjdAo=)
  retrospect.in

BTW note that we're also using the LOCKING flag, which was omitted on the originally posted recipe.

The alternative is to (ideally, in a central place in the procmailrc), identify and extract encoded subjects:

        # extract the subject and decode as appropriate.
        :0
        * ^Subject:[    ]*\/[^  ].*
        {
                SUBJECT=$MATCH
                ORIGSUBJ=$SUBJECT

                # is this a mime-encoded subject line?
                # match for a number of common character sets
                # expand as desired - this is NOT comprehensive
                :0
                * SUBJECT ?? ^^=\?\/(utf-8|Windows-1251|koi8-r)\?B\?
                * MATCH ?? \/[^\?]+
                {
                        SUBJENCODING=$MATCH

                        # now, decode the subject
                        :0
                        * $ SUBJECT ?? ^^=\?${SUBJENCODING}\?B\?\/.*
                        {
                                SUBJECT=`echo "$MATCH" | mimencode -u`
                        }
                }
        }

Then, anywhere you might normally refer to the subject:

* ^Subject: expression

You would instead:

* SUBJECT ?? expression

Specifically, the original recipe becomes:


  :0:
  * ^Message-Id:(_dot_)*(_at_)reader\(_dot_)local\(_dot_)lan
  * SUBJECT ?? ^Retrospect
  retrospect.in

very readable.

Where necessary, you can check SUBJENCODING to see what the character set encoding is. Because of multibyte character encoding for several non-western languages, expect a few errors during decode, due to nulls in the output string. I'd take the above recipes and stuff them into a sandbox, then throw a large corpus of saved emails at them.

Note that in my extraction above, the subject has been stripped of leading whitespace (because for my own purposes, this is desireable). Modify the extraction or your individual references to it accordingly.

FTR, in my own experience, the encoded subject more often than not is employed in SPAM - not that your particular automated message is, but for an abundance of messages, it is.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>