At 12:01 2006-03-02 -0800, Komal Tagdiwala -X (ktagdiwa - Saama
Technologies at Cisco) wrote:
It has been relatively easier for me to filter out non-English emails
(as spam) using procmail by checking for character sets when the mailbox
is expecting only English language emails.
I now need to filter emails for individual languages like Chinese,
Japanese, Korean, etc. where the mailbox can receive a non-English
language character set based emails. Obviously, the character set based
filtering approach won't help me in this requirement to filter
Er, please define "character set based filtering". I suspect you're simply
filtering on 8-bit text or somesuch and considering it non-English (which
isn't wholly correct anyway). Many different languages utilize distinctly
different character sets - if you merely look at the message body and flag
some given hibit character as meaning forieign, you're not going to know
what language unless you refer to the headers.
Have you reviewed my "furrin.rc" script? See the link in my sigline. This
contains a host of reference URLs you might find other suitable information
at, and of course a large distribution of identified character sets,
grouped into general language territories.
1. If I were to create separate recipe files for each language (example:
rc.spam_china, rc.spam_japan, ...), where each recipe has filters for
that specific language, is there any specific
setting/configuration/flags that needs to be done in procmail so that
procmail matches the words listed in the language-specific filters
Procmail has no understanding of foreign character sets beyond telling it
to match for a header (like in furrin.rc), and then matching for some other
What I mean here is that the words to be filtered in each
language-specific recipe are going to be in that language (non-English
characters). Will procmail be able to truthfully interpret those words
in that specific language "as-is" or would procmail interpret them as
ASCII character equivalents/junk characters if the host where procmail
is running does not understand that language (Japan, china, etc.) ?
The latter. They're a series of characters, and if that series of
characters matches in the message, it's a match. Now, if you ADDITIONALLY
have a criteria that the message is in some specific character set
encoding, then you could identify that the string is indeed that word in
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
procmail mailing list Procmail homepage: http://www.procmail.org/