Konstantin <klk206(_at_)panix(_dot_)com> wrote:
Hi,
How it is possible to detect (and filter) an email written in Japanese chara
cters (which I cannot read anyway)?
The content-Type specifies charset="utf-8". The "From" field is apparently i
nvalid, and may not necessarily contain .jp
What I do is:
a) specify a list of charsets that I understand:
OK_CHARSET=(ASCII|DISPAY|ISO-8859-[12]|WINDOWS-125[012]|utf-8|utf8)
b) filter anything that (1) specifies charset, and (2) does -not- have
one of those charsets:h
:0 H
* ^(From|To|Subject): *\=\?\?.*
* ! $ MATCH ?? ${OK_CHARSET}
$DISCARD
:0 H
* ^Content-Type:.*charset\/.*
* ! $ MATCH ?? ${OK_CHARSET}
$DISCARD
c) for 'foreign language' issues, I look for 'commonly occuring'
character-sequences (presumbly 'words', but, since I don't speak
the languge, i'm not -sure- of that:) and look for any of several
such (presumed) 'words' in the message body. e.g.:
for German (which I understand, a little):
:0 H
* ^Subject:.*\<(aufmachen|und|der|Ihr|Ihre|Veil|Zeit)\>
$DISCARD
for Italian (which I don't):
:0 EH
*
^Subject:.*\<(aviso|limitazione|posteitaliane|Urgente|attenzione|logiciel|prospection|de
la)\>
$DISCARD
Japanese (along with most other languages that do not use a 'latin'-
based character set) , in utf-8, is going to have multiple multi-byte
'glyphs' in a single word. Detecting 'glyph' boundaries is little
complex, but emminetly 'doable'. All utf-8 multi-byte glyphs start
with a high-bit-set byte. if the first byte is '0xc2'-'0xdF', it it
a 2-byte glyph (with the 'common' extended characters in the '0xc2'
and '0xc3' sets). If it is '0xe0'-'0xef' it is a 3-byte glyph, and
'0xf0'-'0xff' a 4-byte glyph. All the 'follower' bytes in the glyph
are in the '0x80'-'0xbf' range. The '0xc2' and '0xc3' sets include
the most common 'non ASCII' characters in most 'western' languages.
*MOST* glyphs above the '0xc8' range are either 'specialized use' ones,
or 'non western' language symbols.
Note: you cannot 'safely' drop 'anything' with such a glyph in it
since Microsoft products routinely use use several 3-byte glyphs --
things like 'smartquotes', dashes, etc. (*snarl*)
Your best bet is to look for 'commonly occuring' glyph sequences in the
Japanese utf-8 text, and filter on those sequences.
____________________________________________________________
procmail mailing list Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)de
http://mailman.rwth-aachen.de/mailman/listinfo/procmail