Re: MS Outlook mail creating characters not appearing (possible solution)

2006-11-09 13:11:55
I have a couple of emails that were generated using MS
Outlook which contain some html entities like smart quotes
and the funny "-" character which just appear as "?"
characters in the archive.
MS Outlook has a nasty habit of mislabeling the charset of its
messages with iso-8859-1 instead of MS's extension to it that
contain the characters being used.

=v= Some older versions send out mail that does't specify a
charset, so many apps assume the text is ASCII (which is how the
standard works) though of course it's Windows-1252.

=v= Those particular characters in Windows-1252 violate charset
standards anyway.  Even worse, MS products such as Outlook and
Word insert these standard-violating "smart quotes" in the wrong
places.  Sometimes they're backwards (i.e. a quote will start
with a "curly close quote" and end with a "curly open quote"),
and usually an apostrophe is turned into a "curly single close
quote," which is just wrong.

=v= Someone wrote a routine that looks for these encodings and
turns them into ASCII equivalents.  You lose some fanciness, but
what good is fanciness when it's just wrong?  This has a much
higher probability of turning out correctly than translating
them into iso-8859-1 or UTF-8 (or even HTML entities).  The
code is called "demoroniser" and is available in Perl:

It has been widely ported.  For example, it's in CPAN's
TextToHTML Perl module and is part of Macromedia's ColdFusion
web product.