RE: Something like ereg_replace?
2003-07-05 16:31:06
To add to my own answer:
There are of course plenty more ways to make my work much harder, say
something easy like this:
PO<font color="white">craptext</font>R<font
color="white">craptext</font>N.
That's a harder problem to solve and pattern matching then start to
become uneficient for spam filtering. But that's quite Off Topic and I
do not know enough about the subject.
Regards,
Björn
-----Original Message-----
From: Björn Lilja [mailto:bjorn(_at_)lirasko(_dot_)se]
Sent: Saturday, July 05, 2003 4:06 PM
To: 'procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE'
Subject: RE: Something like ereg_replace?
Hi,
Ok, I accept the answer about that there is no way to
preparse text within procmail, which was my original
question. Thanks for your answers!
I do not need to diferentiate between correct HTML and made
up constructs, sorry for beeing unclear there. My goal is to
first filter _enough_ of everything that is not accurate text
and then do the analysis from there. That would drasicly
increase my performance since most spam i get uses cheep
stuff like s<ds>lu<dsfds>ts. Naturally, I would also then
have to filter the message without removing the tags, and
then I would find the rare cases when someone writes <porn>
like that and it is actually readable in the message.
An again, the stuff I actually read should be unaltered - The
replace function would only be to put a temporary version of
the message into a varable for the spam filter to work on.
Or am I missing the entire point of yours?
Regards,
Björn
No. What about a legitimate quoted HTML such as:
Oh, you should parse for the <IMG> tag
Smileys and other made-up constructs:
<g>
<Rod Serling voiceover>
etc.
or such tags which span quoted message comments (which
commonly start lines
with ">", which would close your HTML)
or math/code:
# operate only on messages less than 25,000 bytes in size
:0
* < 25000
or, intentionally bracketed URL references (i.e. the message
isn't HTML,
but users often quote very long URLs with brackets just to
encapsulate them
across linebreaks when you're using a smart email client.
I do not want to
change the content of the e-mail, just pre parse it in to a
variable so
I can do more accurate filtering. In say perl or php this would
definitely not be a problem and I take it that the eregs
work the same?
If it isn't a problem in Perl, then your best bet is to
implement it in
perl and call your perl program from procmail. Problem
solved. Provided
it's really as easy as you think it is. I say it isn't.
If you simply want to remove HTML constructs, then you'll
need to worry
about which messages acually claim to be HTML, and those
which contain HTML
by reference (such as a technical list). Multipart messages
will also pose
a special grief to you.
Ok, so there is basically no ereg/replace function within
the procmail
functionality then?
Procmail has absolutely *NO* replace functionality. Even
to change a
header, you must call formail. To delete a line from the
body, most people
invoke sed, etc.
that 1) I should be able to receive e-mail from people
interested in my
business or other area even if they are not on my
nobounce/whitelist (I
have one as well) and 2) Many people do unfortunately write their
e-mails in html by default and the risk that someone not one
the list
sends me a legitimate e-mail like that is just to high.
You might simply employ a comment filter - something that tags HTML
messages which contain an excess of HTML comment tags.
Additionally, you
could search for some characteristic tags used in HTML spam,
but which
generally are NOT part of legit communications - webforms for
instance.
I also flag messages which ONLY contain an HTML attach, but
no plaintext,
as I do plaintext which STARTS with an <HTML opening tag.
If you're concerned about possibly missing messages which are legit,
consider adding a filter which pulls messages to the side
based on a
preponderance of terms related to YOUR business - product
names, tradeshows
which you attend, etc, then let those coast through.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer:
<http://www.professional.org/procmail/disclaim> er.html>
Please DO NOT carbon me on list replies. I'll
get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-> Aachen.DE/mailman/listinfo/procmail
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail
<Prev in Thread] |
Current Thread |
[Next in Thread> |
- RE: Something like ereg_replace?,
Björn Lilja <=
|
|
|