procmail
[Top] [All Lists]

Re: Filtering spam for non-English languages like Chinese, Japanese, Korean

2006-03-02 14:29:09
At 12:01 2006-03-02 -0800, Komal Tagdiwala -X (ktagdiwa - Saama 
Technologies at Cisco) wrote:
Hello!

It has been relatively easier for me to filter out non-English emails
(as spam) using procmail by checking for character sets when the mailbox
is expecting only English language emails.

I now need to filter emails for individual languages like Chinese,
Japanese, Korean, etc. where the mailbox can receive a non-English
language character set based emails. Obviously, the character set based
filtering approach won't help me in this requirement to filter
language-specific emails.

Er, please define "character set based filtering".  I suspect you're simply 
filtering on 8-bit text or somesuch and considering it non-English (which 
isn't wholly correct anyway).  Many different languages utilize distinctly 
different character sets - if you merely look at the message body and flag 
some given hibit character as meaning forieign, you're not going to know 
what language unless you refer to the headers.

Have you reviewed my "furrin.rc" script?  See the link in my sigline.  This 
contains a host of reference URLs you might find other suitable information 
at, and of course a large distribution of identified character sets, 
grouped into general language territories.

Questions:
1. If I were to create separate recipe files for each language (example:
rc.spam_china, rc.spam_japan, ...), where each recipe has filters for
that specific language, is there any specific
setting/configuration/flags that needs to be done in procmail so that
procmail matches the words listed in the language-specific filters
correctly ?

Procmail has no understanding of foreign character sets beyond telling it 
to match for a header (like in furrin.rc), and then matching for some other 
ascii text.

 What I mean here is that the words to be filtered in each
language-specific recipe are going to be in that language (non-English
characters). Will procmail be able to truthfully interpret those words
in that specific language "as-is" or would procmail interpret them as
ASCII character equivalents/junk characters if the host where procmail
is running does not understand that language (Japan, china, etc.) ?

The latter.  They're a series of characters, and if that series of 
characters matches in the message, it's a match.  Now, if you ADDITIONALLY 
have a criteria that the message is in some specific character set 
encoding, then you could identify that the string is indeed that word in 
that language.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>