Re: detect an email with japanese characters

From procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)de  Mon May 21 
20:34:42 2012
Subject: Re: detect an email with japanese characters
From: LuKreme <kremels(_at_)kreme(_dot_)com>
Date: Mon, 21 May 2012 19:31:42 -0600
To: "procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de" 
<procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de>

On May 21, 2012, at 17:25, Robert Bonomi 
<bonomi(_at_)mail(_dot_)r-bonomi(_dot_)com> wrote:

Note: you cannot 'safely' drop 'anything' with such a glyph in it
    since Microsoft products routinely use use several 3-byte glyphs --
    things like 'smartquotes', dashes, etc.   (*snarl*)


Oh, it's not just MSFT, there are many high byte characters in UTF-8 tha
t are perfectly usable and proper. The days of 7-bit email are long behin
d us, and that's a good thing.


In 'western' usage, it is exceedingly rare to -need- anything beyond the 
so-called C0 through C3 glyph sets (roughly 256 'printable' symbols).

Microsoft is well known for it's egregious MISUSE of UTF-8 multi-byte 
glyphs.  *Especially* in documents that are identified as using something 
_other_ than UTF-8.  One simply cannot 'trust' MS products to get the 
'content-type' right.  Their products are notorious for, say, _declaring_
a document as 'iso-8859-1' or 'Windows-1251', but including in that 
document a handful of UTF-8 3-byte sequences from the '0xe2', '0xe7', 
and '0xef' ranges.  

For processing arbitrary e-mail from a Microsoft product, one has to
essentially throw away the declared charset, parse out the 'valid'
ASCII/ISO-8859/WINDOWS-125x/UTF-8 glyphs that one can recognize, and
do 'something sensible' with 'whatever is left unrecognied'.

It takes a couple of hundred lines of 'C' code to convert a putative
ASCII/ISO-8859/WINDOWS-125x/UTF-8 document to a consistent format,
say ISO-8859-1.  I know, I've written it.

____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail