Re: Decoding of and matching to RFC2047-encoded subjects?

Hannu Koivisto wrote:

What tools would you recommend for matching to RFC2047-encoded
subjects in procmail recipes? Ability to handle 8bit character sets
like ISO-8859-1 is enough (I'll pass the decoded subject to egrep;
if it ends up being gibberish, I don't care, the regexp just doesn't
match it which is ok by design of my recipes).


I recommend perl, mimencode, formail, and a simple recursive recipe.
(Recursion is needed because there can be more than one encoded string
in a header, and there can be more than one header with encoding.) Here
is how I do it.

One problem with such decoding is that the result may include control
characters that change the state of your display (for instance, putting
a VT emulator into graphics mode). To handle this, I replace many
control characters with inverted question marks. This processing can, of
course, be removed if it doesn't fit your situation.

In the calling rc file (typically .procmailrc):

  mimehdr_=[^o][^l].*:.*=\\?(iso-8859-1|utf-8)\\?[bq]\\?[^?]+\\?.*
  bq=B?bQ?q                           ## upper case to lower case
  pmrc=path/to/rc/files               ##
  :0                                  ##
  * $ ^\/$mimehdr_                    ##
  { INCLUDERC=$pmrc/demimehdr.rc }    ##

In $pmrc/demimehdr.rc:

  hdrtxt=$MATCH                       ## save and proceed
  hdr txtb txte code                  ## reset temporary variables
  :0                                  ## extract header
  * MATCH ?? ^^\/[^:]+:~s             ##  name for regeneration
  { hdr=$MATCH }                      ##  and reporting
  :0                                  ## guard against non-coded
  * hdrtxt ?? ^^~S+:~c\/.*=\?[iu]     ##  '=?' in header
  * MATCH ?? ()\/.+[^iu]              ##  by dropping
  * MATCH ?? ()\/.+[^?]               ##  the trailing
  * MATCH ?? ()\/.+[^=]               ##  3 characters
  { txtb=$MATCH }                     ##  preceding the encoded part
  :0                                  ## grab trailing plain text
  * $ hdrtxt ?? $\txtb=\?.+\?.\?.+\?=\/.*       ##  which may
  { txte=$MATCH }                               ##  be empty
  :0                                            ## grab encoded text
  * $ hdrtxt ?? $\txtb=\?.+\?.\?\/[^?]+         ##  which is never
  { code=$MATCH }                     ##  empty
  :0 f h w                            ## grab encoding type and translate
  * hdrtxt ?? \?\/[bq]\?              ##  it to lower case and rewrite  
  * $ bq ?? $\MATCH\/.                ##  header with decoded text
  | formail -i "$hdr$txtb$(print -n $code | mimencode \   ##  disarm control
    -u -p -$MATCH | perl -pe 's/[\00-\010\012-\037]/~?/g')$txte" \  ## chars
            -A "X-Munged: ${hdr}converted from $MATCH"    ##  and report
  :0 a                                                    ## look for more
  * $ ^\/$mimehdr_                                        ##  more?
  { INCLUDERC=$_ }                                        ##  recurse

Note that this code is first handled by a preprocessor which strips
comments and expands all two-character tokens beginning with a tilde
(~s, ~c, ~S, ~?). I find that rc files are more readable this way.
  ~s expands to [  ] (space and tab character class)
  ~S expands to [^  ] (not ~s)
  ~c expands to [  ]*(\([^()]*\)[  ]*)*, which does a reasonable job
     of matching linear whitespace and comments, with the brackets
     enclosing space tab.
  ~? expands to 0xBF, an inverted question mark.

Note also that the case translation of the encoding is actually any case
to lower case, since the recipe is not explicitly case sensitive.

This leaves a trail of Old- and X-Munged headers which can be traced
back.

I'm sure it can be improved, but it works for whatever I've had thrown
at it so far.

-- 
Rik Kabel          Old enough to be an adult              
rik(_at_)netcom(_dot_)com