procmail
[Top] [All Lists]

RE: Filtering spam for non-English languages like Chinese, Japanese,Korean

2006-03-02 14:38:03
Hi Sean!

Thanks for the prompt response.  I use the following condition to do
character set based filtering (looking at the headers)

:0
*
^Content-Type:.*(gb2312|big5|euc-cn|hz-gb-2312|x-mac-chinesesimp|cp-936|
x-mac-chinesetrad|cp-950|cp-932|euc-tw)
{
   
        # Recipe to capture spam for chinese emails
        # 1. Search Subject and Body for specific words for tagging as
spam [THIS_IS_SPAM_EMAIL]
}

Likewise, I check for other character sets for Japanese and Korean

- Komal

-----Original Message-----
From: procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
[mailto:procmail-bounces(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE] On Behalf Of 
Professional
Software Engineering
Sent: Thursday, March 02, 2006 1:10 PM
To: procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
Subject: Re: Filtering spam for non-English languages like Chinese,
Japanese,Korean

At 12:01 2006-03-02 -0800, Komal Tagdiwala -X (ktagdiwa - Saama
Technologies at Cisco) wrote:
Hello!

It has been relatively easier for me to filter out non-English emails 
(as spam) using procmail by checking for character sets when the 
mailbox is expecting only English language emails.

I now need to filter emails for individual languages like Chinese, 
Japanese, Korean, etc. where the mailbox can receive a non-English 
language character set based emails. Obviously, the character set based

filtering approach won't help me in this requirement to filter 
language-specific emails.

Er, please define "character set based filtering".  I suspect you're
simply filtering on 8-bit text or somesuch and considering it
non-English (which isn't wholly correct anyway).  Many different
languages utilize distinctly different character sets - if you merely
look at the message body and flag some given hibit character as meaning
forieign, you're not going to know what language unless you refer to the
headers.

Have you reviewed my "furrin.rc" script?  See the link in my sigline.
This contains a host of reference URLs you might find other suitable
information at, and of course a large distribution of identified
character sets, grouped into general language territories.

Questions:
1. If I were to create separate recipe files for each language
(example:
rc.spam_china, rc.spam_japan, ...), where each recipe has filters for 
that specific language, is there any specific 
setting/configuration/flags that needs to be done in procmail so that 
procmail matches the words listed in the language-specific filters 
correctly ?

Procmail has no understanding of foreign character sets beyond telling
it to match for a header (like in furrin.rc), and then matching for some
other ascii text.

 What I mean here is that the words to be filtered in each 
language-specific recipe are going to be in that language (non-English 
characters). Will procmail be able to truthfully interpret those words 
in that specific language "as-is" or would procmail interpret them as 
ASCII character equivalents/junk characters if the host where procmail 
is running does not understand that language (Japan, china, etc.) ?

The latter.  They're a series of characters, and if that series of
characters matches in the message, it's a match.  Now, if you
ADDITIONALLY have a criteria that the message is in some specific
character set encoding, then you could identify that the string is
indeed that word in that language.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer:
<http://www.professional.org/procmail/disclaimer.html>
  Please DO NOT carbon me on list replies.  I'll get my copy from the
list.


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>
  • RE: Filtering spam for non-English languages like Chinese, Japanese,Korean, Komal Tagdiwala -X (ktagdiwa - Saama Technologies at Cisco) <=