procmail
[Top] [All Lists]

RE: Something like ereg_replace?

2003-07-05 16:15:03
Hi,

Ok, I accept the answer about that there is no way to preparse text
within procmail, which was my original question. Thanks for your
answers!

I do not need to diferentiate between correct HTML and made up
constructs, sorry for beeing unclear there. My goal is to first filter
_enough_ of everything that is not accurate text and then do the
analysis from there. That would drasicly increase my performance since
most spam i get uses cheep stuff like s<ds>lu<dsfds>ts. Naturally, I
would also then have to filter the message without removing the tags,
and then I would find the rare cases when someone writes <porn> like
that and it is actually readable in the message.

An again, the stuff I actually read should be unaltered - The replace
function would only be to put a temporary version of the message into a
varable for the spam filter to work on.

Or am I missing the entire point of yours?

Regards,
Björn

No.  What about a legitimate quoted HTML such as:

         Oh, you should parse for the <IMG> tag

Smileys and other made-up constructs:

         <g>
         <Rod Serling voiceover>

         etc.

or such tags which span quoted message comments (which 
commonly start lines 
with ">", which would close your HTML)

or math/code:

         # operate only on messages less than 25,000 bytes in size
         :0
         * < 25000

or, intentionally bracketed URL references (i.e. the message 
isn't HTML, 
but users often quote very long URLs with brackets just to 
encapsulate them 
across linebreaks when you're using a smart email client.

 I do not want to
change the content of the e-mail, just pre parse it in to a 
variable so 
I can do more accurate filtering. In say perl or php this would 
definitely not be a problem and I take it that the eregs 
work the same?

If it isn't a problem in Perl, then your best bet is to 
implement it in 
perl and call your perl program from procmail. Problem 
solved.  Provided 
it's really as easy as you think it is.  I say it isn't.

If you simply want to remove HTML constructs, then you'll 
need to worry 
about which messages acually claim to be HTML, and those 
which contain HTML 
by reference (such as a technical list).  Multipart messages 
will also pose 
a special grief to you.

Ok, so there is basically no ereg/replace function within 
the procmail 
functionality then?

Procmail has absolutely *NO* replace functionality.  Even to change a 
header, you must call formail.  To delete a line from the 
body, most people 
invoke sed, etc.

that 1) I should be able to receive e-mail from people 
interested in my 
business or other area even if they are not on my 
nobounce/whitelist (I 
have one as well) and 2) Many people do unfortunately write their 
e-mails in html by default and the risk that someone not one 
the list 
sends me a legitimate e-mail like that is just to high.

You might simply employ a comment filter - something that tags HTML 
messages which contain an excess of HTML comment tags.  
Additionally, you 
could search for some characteristic tags used in HTML spam, 
but which 
generally are NOT part of legit communications - webforms for 
instance.

I also flag messages which ONLY contain an HTML attach, but 
no plaintext, 
as I do plaintext which STARTS with an <HTML opening tag.

If you're concerned about possibly missing messages which are legit, 
consider adding a filter which pulls messages to the side based on a 
preponderance of terms related to YOUR business - product 
names, tradeshows 
which you attend, etc, then let those coast through.

---
  Sean B. Straw / Professional Software Engineering

  Procmail disclaimer: 
<http://www.professional.org/procmail/disclaim> er.html>
  
Please DO NOT carbon me on list replies.  I'll 
get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE 
http://MailMan.RWTH-> Aachen.DE/mailman/listinfo/procmail




_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail


<Prev in Thread] Current Thread [Next in Thread>