procmail
[Top] [All Lists]

RE: Something like ereg_replace?

2003-07-05 15:40:21
At 14:08 2003-07-05 -0700, Björn Lilja did say:
The actual ereg for filtering a html-tag with attributes etc I do not
worry about if there was a function like ereg_replace. Basically I could
just filter everything within the < > tags, right...?

No.  What about a legitimate quoted HTML such as:

        Oh, you should parse for the <IMG> tag

Smileys and other made-up constructs:

        <g>
        <Rod Serling voiceover>

        etc.

or such tags which span quoted message comments (which commonly start lines with ">", which would close your HTML)

or math/code:

        # operate only on messages less than 25,000 bytes in size
        :0
        * < 25000

or, intentionally bracketed URL references (i.e. the message isn't HTML, but users often quote very long URLs with brackets just to encapsulate them across linebreaks when you're using a smart email client.

 I do not want to
change the content of the e-mail, just pre parse it in to a variable so
I can do more accurate filtering. In say perl or php this would
definitely not be a problem and I take it that the eregs work the same?

If it isn't a problem in Perl, then your best bet is to implement it in perl and call your perl program from procmail. Problem solved. Provided it's really as easy as you think it is. I say it isn't.

If you simply want to remove HTML constructs, then you'll need to worry about which messages acually claim to be HTML, and those which contain HTML by reference (such as a technical list). Multipart messages will also pose a special grief to you.

Ok, so there is basically no ereg/replace function within the procmail
functionality then?

Procmail has absolutely *NO* replace functionality. Even to change a header, you must call formail. To delete a line from the body, most people invoke sed, etc.

that 1) I should be able to receive e-mail from people interested in my
business or other area even if they are not on my nobounce/whitelist (I
have one as well) and 2) Many people do unfortunately write their
e-mails in html by default and the risk that someone not one the list
sends me a legitimate e-mail like that is just to high.

You might simply employ a comment filter - something that tags HTML messages which contain an excess of HTML comment tags. Additionally, you could search for some characteristic tags used in HTML spam, but which generally are NOT part of legit communications - webforms for instance.

I also flag messages which ONLY contain an HTML attach, but no plaintext, as I do plaintext which STARTS with an <HTML opening tag.

If you're concerned about possibly missing messages which are legit, consider adding a filter which pulls messages to the side based on a preponderance of terms related to YOUR business - product names, tradeshows which you attend, etc, then let those coast through.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail


<Prev in Thread] Current Thread [Next in Thread>