From procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)de Mon May 21
20:34:42 2012
Subject: Re: detect an email with japanese characters
From: LuKreme <kremels(_at_)kreme(_dot_)com>
Date: Mon, 21 May 2012 19:31:42 -0600
To: "procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de"
<procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de>
On May 21, 2012, at 17:25, Robert Bonomi
<bonomi(_at_)mail(_dot_)r-bonomi(_dot_)com> wrote:
Note: you cannot 'safely' drop 'anything' with such a glyph in it
since Microsoft products routinely use use several 3-byte glyphs --
things like 'smartquotes', dashes, etc. (*snarl*)
Oh, it's not just MSFT, there are many high byte characters in UTF-8 tha
t are perfectly usable and proper. The days of 7-bit email are long behin
d us, and that's a good thing.
In 'western' usage, it is exceedingly rare to -need- anything beyond the
so-called C0 through C3 glyph sets (roughly 256 'printable' symbols).
Microsoft is well known for it's egregious MISUSE of UTF-8 multi-byte
glyphs. *Especially* in documents that are identified as using something
_other_ than UTF-8. One simply cannot 'trust' MS products to get the
'content-type' right. Their products are notorious for, say, _declaring_
a document as 'iso-8859-1' or 'Windows-1251', but including in that
document a handful of UTF-8 3-byte sequences from the '0xe2', '0xe7',
and '0xef' ranges.
For processing arbitrary e-mail from a Microsoft product, one has to
essentially throw away the declared charset, parse out the 'valid'
ASCII/ISO-8859/WINDOWS-125x/UTF-8 glyphs that one can recognize, and
do 'something sensible' with 'whatever is left unrecognied'.
It takes a couple of hundred lines of 'C' code to convert a putative
ASCII/ISO-8859/WINDOWS-125x/UTF-8 document to a consistent format,
say ISO-8859-1. I know, I've written it.
____________________________________________________________
procmail mailing list Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail