I have a couple of emails that were generated using MS
Outlook which contain some html entities like smart quotes
and the funny "-" character which just appear as "?"
characters in the archive.
MS Outlook has a nasty habit of mislabeling the charset of its
messages with iso-8859-1 instead of MS's extension to it that
contain the characters being used.
=v= Some older versions send out mail that does't specify a
charset, so many apps assume the text is ASCII (which is how the
standard works) though of course it's Windows-1252.
=v= Those particular characters in Windows-1252 violate charset
standards anyway. Even worse, MS products such as Outlook and
Word insert these standard-violating "smart quotes" in the wrong
places. Sometimes they're backwards (i.e. a quote will start
with a "curly close quote" and end with a "curly open quote"),
and usually an apostrophe is turned into a "curly single close
quote," which is just wrong.
=v= Someone wrote a routine that looks for these encodings and
turns them into ASCII equivalents. You lose some fanciness, but
what good is fanciness when it's just wrong? This has a much
higher probability of turning out correctly than translating
them into iso-8859-1 or UTF-8 (or even HTML entities). The
code is called "demoroniser" and is available in Perl:
It has been widely ported. For example, it's in CPAN's
TextToHTML Perl module and is part of Macromedia's ColdFusion