RE: Something like ereg

At 14:08 2003-07-05 -0700, Björn Lilja did say:

The actual ereg for filtering a html-tag with attributes etc I do not
worry about if there was a function like ereg_replace. Basically I could
just filter everything within the < > tags, right...?


No.  What about a legitimate quoted HTML such as:

        Oh, you should parse for the <IMG> tag

Smileys and other made-up constructs:

        <g>
        <Rod Serling voiceover>

        etc.

or such tags which span quoted message comments (which commonly start lineswith ">", which would close your HTML)


or math/code:

        # operate only on messages less than 25,000 bytes in size
        :0
        * < 25000

or, intentionally bracketed URL references (i.e. the message isn't HTML,but users often quote very long URLs with brackets just to encapsulate themacross linebreaks when you're using a smart email client.

 I do not want to
change the content of the e-mail, just pre parse it in to a variable so
I can do more accurate filtering. In say perl or php this would
definitely not be a problem and I take it that the eregs work the same?

If it isn't a problem in Perl, then your best bet is to implement it inperl and call your perl program from procmail. Problem solved. Providedit's really as easy as you think it is. I say it isn't.

If you simply want to remove HTML constructs, then you'll need to worryabout which messages acually claim to be HTML, and those which contain HTMLby reference (such as a technical list). Multipart messages will also posea special grief to you.

Ok, so there is basically no ereg/replace function within the procmail
functionality then?

Procmail has absolutely *NO* replace functionality. Even to change aheader, you must call formail. To delete a line from the body, most peopleinvoke sed, etc.

that 1) I should be able to receive e-mail from people interested in my
business or other area even if they are not on my nobounce/whitelist (I
have one as well) and 2) Many people do unfortunately write their
e-mails in html by default and the risk that someone not one the list
sends me a legitimate e-mail like that is just to high.

You might simply employ a comment filter - something that tags HTMLmessages which contain an excess of HTML comment tags. Additionally, youcould search for some characteristic tags used in HTML spam, but whichgenerally are NOT part of legit communications - webforms for instance.

I also flag messages which ONLY contain an HTML attach, but no plaintext,as I do plaintext which STARTS with an <HTML opening tag.

If you're concerned about possibly missing messages which are legit,consider adding a filter which pulls messages to the side based on apreponderance of terms related to YOUR business - product names, tradeshowswhich you attend, etc, then let those coast through.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

RE: Something like ereg_replace?