At 14:08 2003-07-05 -0700, Björn Lilja did say:
The actual ereg for filtering a html-tag with attributes etc I do not
worry about if there was a function like ereg_replace. Basically I could
just filter everything within the < > tags, right...?
No. What about a legitimate quoted HTML such as:
Oh, you should parse for the <IMG> tag
Smileys and other made-up constructs:
<g>
<Rod Serling voiceover>
etc.
or such tags which span quoted message comments (which commonly start lines
with ">", which would close your HTML)
or math/code:
# operate only on messages less than 25,000 bytes in size
:0
* < 25000
or, intentionally bracketed URL references (i.e. the message isn't HTML,
but users often quote very long URLs with brackets just to encapsulate them
across linebreaks when you're using a smart email client.
I do not want to
change the content of the e-mail, just pre parse it in to a variable so
I can do more accurate filtering. In say perl or php this would
definitely not be a problem and I take it that the eregs work the same?
If it isn't a problem in Perl, then your best bet is to implement it in
perl and call your perl program from procmail. Problem solved. Provided
it's really as easy as you think it is. I say it isn't.
If you simply want to remove HTML constructs, then you'll need to worry
about which messages acually claim to be HTML, and those which contain HTML
by reference (such as a technical list). Multipart messages will also pose
a special grief to you.
Ok, so there is basically no ereg/replace function within the procmail
functionality then?
Procmail has absolutely *NO* replace functionality. Even to change a
header, you must call formail. To delete a line from the body, most people
invoke sed, etc.
that 1) I should be able to receive e-mail from people interested in my
business or other area even if they are not on my nobounce/whitelist (I
have one as well) and 2) Many people do unfortunately write their
e-mails in html by default and the risk that someone not one the list
sends me a legitimate e-mail like that is just to high.
You might simply employ a comment filter - something that tags HTML
messages which contain an excess of HTML comment tags. Additionally, you
could search for some characteristic tags used in HTML spam, but which
generally are NOT part of legit communications - webforms for instance.
I also flag messages which ONLY contain an HTML attach, but no plaintext,
as I do plaintext which STARTS with an <HTML opening tag.
If you're concerned about possibly missing messages which are legit,
consider adding a filter which pulls messages to the side based on a
preponderance of terms related to YOUR business - product names, tradeshows
which you attend, etc, then let those coast through.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail