procmail
[Top] [All Lists]

Re: Fixing broken ISO-8859-1 e-mail

1999-07-01 06:50:39
On Thu, 1 Jul 1999 14:50:32 +0200 (MET DST), Ralph SOBEK
<sobek(_at_)irit(_dot_)fr> wrote:
     I often get e-mail that states that it is iso-8859-1 and that
the Content-Transfer-Encoding is 8bit.  Still, certain special
characters show up as unknown 8-bit codes.  It seems to happen only if
the originating machine was a PC.  Is there a filter available that

The general answer is that if it claims to be Latin-1 but isn't,
you're in the dark as to what it could be. Windows with preferences
usual in Western Europe can perhaps reasonably be expected to use
codepage 1252 but how do you know if the user actually had selected a
Western codepage?

For this reason, it's not very likely that you will find any
thoroughly tested and generally accepted solution. But if you are
willing to take my guesses, here goes:

can correct this?  For an example, an e-mail with the following
headers:

X-Mailer: Microsoft Outlook Express 4.72.3110.5
X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3

produces iso-8859-1 text with properly accented letters but I find the
following codes (the code is in the second column - and is unreadable
for me):

     \226    ? 
     \205    ?

A listing of Windows CP1252 (and then some) is available from
<http://czyborra.com/charsets/codepages.html#CP1252> but it doesn't
have the character codes in octal notation. The characters above are
0x96 and 0x85 in hex; they appear to correspond to en dash and
ellipsis, respectively. Probably you have also stumbled over single
(0x91 0x92) and double (0x93 and 0x94) quotation marks, and perhaps
the French oe ligature (0x9C and 0x8C in uppercase).

Here's a simple script to replace those. (Since the quoted posting
already contains unencoded 8-bit characters, I'm taking the liberty to
put in some "funny" codes here. Contact your manufacturer if your
machine crashes :-)

    :0bfw
    * ^Content-Type:\<*text/plain;\<*charset=iso-8859-1
    * ^X-Mailer:\<*microsoft\<+outlook\<+express\<+4\.72\.
    * ^X-MimeOLE:|#... make this as tight as you want, this is enough for me
    * [?-?]
    | perl -pe 's/\x91/`/g; s/\x92/'"'"'/g; s/\x93/``/g; s/\x94/'"''"'/g;' \
        -e 's/\x9c/oe/g; s/\x8c/OE/g; s/\x96/--/g; s/\x85/.../g;'

I've used Perl in the action because it's guaranteed to be able to
cope with 8-bit data, whereas the same might not necessarily hold for
your variant of sed or tr. If your tr can handle it, it's the most
efficient for one-to-one replacements; to replace a string with
another string (where at least one of them is longer than one
character) you need sed or something (where "something" should be the
most lightweight 8-bit capable version of sed, awk, nawk, gawk, or
perl you have available, roughly in that order of preference).

You might also want to to notify the user that your local filters have
tampered with the message, by adding either a footer (which could be
done as part of the above recipe) or a header:

    :0afhw
    | formail -a 'X-Notice: faux iso-8859-1 replaced with approximations'

Automatically mailbombing the offender left as an exercise.

Hope this helps,

/* era */

-- 
.obBotBait: It shouldn't even matter whether     <http://www.iki.fi/era/>
I am a resident of the state of Washington. <http://members.xoom.com/procmail/>
 * Sign the European spam petition! <http://www.politik-digital.de/spam/en/> *

<Prev in Thread] Current Thread [Next in Thread>