procmail
[Top] [All Lists]

RE: Something like ereg_replace?

2003-07-05 16:31:06
To add to my own answer:
There are of course plenty more ways to make my work much harder, say
something easy like this:
PO<font color="white">craptext</font>R<font
color="white">craptext</font>N.

That's a harder problem to solve and pattern matching then start to
become uneficient for spam filtering. But that's quite Off Topic and I
do not know enough about the subject.

Regards,
Björn

-----Original Message-----
From: Björn Lilja [mailto:bjorn(_at_)lirasko(_dot_)se] 
Sent: Saturday, July 05, 2003 4:06 PM
To: 'procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE'
Subject: RE: Something like ereg_replace?


Hi,

Ok, I accept the answer about that there is no way to 
preparse text within procmail, which was my original 
question. Thanks for your answers!

I do not need to diferentiate between correct HTML and made 
up constructs, sorry for beeing unclear there. My goal is to 
first filter _enough_ of everything that is not accurate text 
and then do the analysis from there. That would drasicly 
increase my performance since most spam i get uses cheep 
stuff like s<ds>lu<dsfds>ts. Naturally, I would also then 
have to filter the message without removing the tags, and 
then I would find the rare cases when someone writes <porn> 
like that and it is actually readable in the message.

An again, the stuff I actually read should be unaltered - The 
replace function would only be to put a temporary version of 
the message into a varable for the spam filter to work on.

Or am I missing the entire point of yours?

Regards,
Björn

No.  What about a legitimate quoted HTML such as:

         Oh, you should parse for the <IMG> tag

Smileys and other made-up constructs:

         <g>
         <Rod Serling voiceover>

         etc.

or such tags which span quoted message comments (which
commonly start lines 
with ">", which would close your HTML)

or math/code:

         # operate only on messages less than 25,000 bytes in size
         :0
         * < 25000

or, intentionally bracketed URL references (i.e. the message
isn't HTML, 
but users often quote very long URLs with brackets just to 
encapsulate them 
across linebreaks when you're using a smart email client.

 I do not want to
change the content of the e-mail, just pre parse it in to a
variable so
I can do more accurate filtering. In say perl or php this would
definitely not be a problem and I take it that the eregs 
work the same?

If it isn't a problem in Perl, then your best bet is to
implement it in 
perl and call your perl program from procmail. Problem 
solved.  Provided 
it's really as easy as you think it is.  I say it isn't.

If you simply want to remove HTML constructs, then you'll
need to worry 
about which messages acually claim to be HTML, and those 
which contain HTML 
by reference (such as a technical list).  Multipart messages 
will also pose 
a special grief to you.

Ok, so there is basically no ereg/replace function within
the procmail
functionality then?

Procmail has absolutely *NO* replace functionality.  Even 
to change a
header, you must call formail.  To delete a line from the 
body, most people 
invoke sed, etc.

that 1) I should be able to receive e-mail from people
interested in my
business or other area even if they are not on my
nobounce/whitelist (I
have one as well) and 2) Many people do unfortunately write their
e-mails in html by default and the risk that someone not one 
the list
sends me a legitimate e-mail like that is just to high.

You might simply employ a comment filter - something that tags HTML
messages which contain an excess of HTML comment tags.  
Additionally, you 
could search for some characteristic tags used in HTML spam, 
but which 
generally are NOT part of legit communications - webforms for 
instance.

I also flag messages which ONLY contain an HTML attach, but
no plaintext, 
as I do plaintext which STARTS with an <HTML opening tag.

If you're concerned about possibly missing messages which are legit,
consider adding a filter which pulls messages to the side 
based on a 
preponderance of terms related to YOUR business - product 
names, tradeshows 
which you attend, etc, then let those coast through.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer:
<http://www.professional.org/procmail/disclaim> er.html>
  
Please DO NOT carbon me on list replies.  I'll
get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-> Aachen.DE/mailman/listinfo/procmail





_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail


<Prev in Thread] Current Thread [Next in Thread>