RE: Something like ereg

BTW, I hope it isn't oo much to ask, but would you mind *NOT* quotingentire previous posts when you're not addressing specific elements ofthem? Everyone on this list has access to the original messages - we don'tneed countless copies of them resent to us at the bottoms of replies tackedon by MS OutBreak.


At 16:05 2003-07-05 -0700, Björn Lilja did say:

I do not need to diferentiate between correct HTML and made up
constructs, sorry for beeing unclear there. My goal is to first filter
_enough_ of everything that is not accurate text

What constitutes _accurate_ text to you, and to the next guy are obviouslyvery different. What happens when someone does:


<<< text written by joe schmoe >>>

For quoting?  Most likely, you blintz it.  All.

I detailed a number of other conditions which will trip up anything thatsimply runs through the message and tries to pair up <> markers, notably ifthe message IS NOT sent as HTML. Or if it's a plaintext QUOTE of an HTMLdoc, etc.

You should demime the messages as well - quoted-printable and base-64encoded messages won't scan very well at all. Of course, IME, suchmessages are spam anyway, so why scan them, when you can file 'em away onthat simple characteristic unto itself?

Oh, and a rising star of spam is to send HTML messages with javascript,which writes parts of the displayed document via the script.

analysis from there. That would drasicly increase my performance since
most spam i get uses cheep stuff like s<ds>lu<dsfds>ts.

If they're not PAIRED (open <ds> and close </ds>, even though that isn't alegit token), then how do you know you're parsing correct token pairs whenremoving the supposed html constructs?

If you have "cheap stuff" why not see if you can't weigh it withscoring? Perhaps simply count the sheer number of them html tag markers:


:0
* 1^1 [^<]<
* 1^1 [^>]>

That, I think, should keep you from flagging instances where the symbolsstart the line, so multiple depths of quotes shouldn't pose a problem, ceptwhen some hoser does:


>>:>>>  bleh
BL> bleh

and the like, which admittedly is increasingly common.

Of course, if it's a legit HTML document, you'll have EXACTLY the samenumber of < as you do >, because they should all be paired, and the textsymbols themselves should be escaped: < > However, you're stuckwith needing to split the HTML portion from the TEXT and otherportions. Then, there's the BASE64 factor to consider.

Naturally, I would also then have to filter the message without removingthe tags, and then I would find the rare cases when someone writes <porn> like
that and it is actually readable in the message.

Naturally, this means processing the message twice (or more), screening forthe same things.

An again, the stuff I actually read should be unaltered - The replace
function would only be to put a temporary version of the message into a
varable for the spam filter to work on.


Which is why you'd pipe it to an external program, for example:

SCRUBBED=|someperlscrpt.pl

Or am I missing the entire point of yours?

Just that you probably don't need to go to these extents - there are ampleindicators for false HTML (and the majority of spam is identifiable fromthe headers alone, which process a LOT faster than a potentially large bodywill).

At some point, you might come to realize that the added CPU overhead ofchurning every which way on a message to extract body elements is too farhigh for the handful of messages it will ever tag.


---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

RE: Something like ereg_replace?