procmail
[Top] [All Lists]

Re: detect an email with japanese characters

2012-05-21 18:23:45

 Konstantin <klk206(_at_)panix(_dot_)com> wrote:

Hi,

How it is possible to detect (and filter) an email written in Japanese chara
cters (which I cannot read anyway)?

The content-Type specifies charset="utf-8". The "From" field is apparently i
nvalid, and may not necessarily contain .jp

What I do is:

  a) specify a list of charsets that I understand:

     OK_CHARSET=(ASCII|DISPAY|ISO-8859-[12]|WINDOWS-125[012]|utf-8|utf8)

  b) filter anything that (1) specifies charset, and (2) does -not- have
     one of those charsets:h

     :0 H
     * ^(From|To|Subject): *\=\?\?.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

     :0 H
     * ^Content-Type:.*charset\/.*
     * ! $ MATCH ?? ${OK_CHARSET}
     $DISCARD

  c) for 'foreign language' issues, I look for 'commonly occuring' 
     character-sequences (presumbly 'words', but, since I don't speak
     the languge, i'm not -sure- of that:) and look for any of several
     such (presumed) 'words' in the message body.  e.g.:
       for German (which I understand, a little):
        :0 H
        * ^Subject:.*\<(aufmachen|und|der|Ihr|Ihre|Veil|Zeit)\>
        $DISCARD

       for Italian (which I don't):
        :0 EH
        * 
^Subject:.*\<(aviso|limitazione|posteitaliane|Urgente|attenzione|logiciel|prospection|de
 la)\>
        $DISCARD
     Japanese (along with most other languages that do not use a 'latin'-
     based character set) , in utf-8, is going to have multiple multi-byte
     'glyphs' in a single word.  Detecting 'glyph' boundaries is little 
     complex, but emminetly 'doable'. All utf-8 multi-byte glyphs start 
     with a high-bit-set byte. if the first byte is '0xc2'-'0xdF', it it 
     a 2-byte glyph (with the 'common' extended characters in the '0xc2' 
     and '0xc3' sets). If it is '0xe0'-'0xef' it is a 3-byte glyph, and 
     '0xf0'-'0xff' a 4-byte glyph.  All the 'follower' bytes in the glyph
     are in the '0x80'-'0xbf' range.  The '0xc2' and '0xc3' sets include 
     the most common 'non ASCII' characters in most 'western' languages.

     *MOST* glyphs above the '0xc8' range are either 'specialized use' ones,
     or 'non western' language symbols.

     Note: you cannot 'safely' drop 'anything' with such a glyph in it
     since Microsoft products routinely use use several 3-byte glyphs --
     things like 'smartquotes', dashes, etc.   (*snarl*)

Your best bet is to look for 'commonly occuring' glyph sequences in the 
Japanese utf-8 text, and filter on those sequences.
____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail