procmail
[Top] [All Lists]

Re: Dekoding eight-bit characters in headers

1997-05-08 13:37:00
"J. Daniel Smith" <DanS(_at_)bristol(_dot_)com> writes:
Philip Guenther writes on 19 April 1997 at 18:55:26
Robin S Socha <uzs8kb(_at_)uni-bonn(_dot_)de> writes:
On Fri, 18 Apr 1997, Philip Guenther wrote:
I seem to be one of the lucky few that never get to enjoy the interesting
recipes ;-/

 >If you don't have libwww, just perl5, then the following _should_ work:
[...]
   :0 fh
   * =\?ISO-8859-[0-9]+\?Q\?
   |perl -pe 's#=\?ISO-8859-\d+\?Q\?(.*?)\?=#$s=$1; $s=~s/\s+(\r?\n)/$1/g;' 
\
        -e '$s=~s/=\r?\n//g; $s=~s/=([\da-fA-F]{2})/pack("C", hex($1))/ge;' 
\
        -e '$s#ge;'

Blech!  This is a downright *ugly* bit of code.  Seems to me like a
"formail" option is the best place to deal with this, although I can
see that adding smarts about charsets and the like could add a fair
amount of overhead.

The above decoding is actually incorrect (it fails several cases
involving surrounding whitespace), and in fact may cause mis-processing
by your MUA, as the encoding is allowed to 'hide' characters special to
the field (e.g., commas in a To: field).  To quote rfc2047:

6.2. Display of 'encoded-word's

     Any 'encoded-word's so recognized are decoded, and if possible,
     the resulting unencoded text is displayed in the original
     character set.

     NOTE: Decoding and display of encoded-words occurs *after* a
     structured field body is parsed into tokens.  It is therefore
     possible to hide 'special' characters in encoded-words which, when
     displayed, will be indistinguishable from 'special' characters in
     the surrounding text. For this and other reasons, it is NOT
     generally possible to translate a message header containing
     'encoded-word's to an unencoded form which can be parsed by an RFC
     822 mail reader.


Because of these difficulties, is would *not* recommend doing the above
mis-decoding at all.  Fix or replace your MUA, but don't do the above
and then wonder why your MUA caught fire (or worse, misdirected as
reply) on that last message.

Adding the above capabilities to formail is totally inappropriate, as
use of that option breaks conforming mail messages.

I can't find it now, but I remember Eric Allman turning down a request
that sendmail do this for similar reasons.  He also mentioned that
sendmail can't do it the other way (8bit -> encoded 7bit) as it doesn't
know what charset to use (and no, sendmail really can't work around
that: consider the case of a Japanese person in Israel.  His headers
may be in KOI-8 while the body is in ISO-8859-8.)

I therefore officially retract the "_should_" in my original text,
and instead insert "_doesn't_".

Philip Guenther

<Prev in Thread] Current Thread [Next in Thread>