procmail
[Top] [All Lists]

RE: Something like ereg_replace?

2003-07-05 17:14:11

BTW, I hope it isn't oo much to ask, but would you mind *NOT* quoting entire previous posts when you're not addressing specific elements of them? Everyone on this list has access to the original messages - we don't need countless copies of them resent to us at the bottoms of replies tacked on by MS OutBreak.

At 16:05 2003-07-05 -0700, Björn Lilja did say:

I do not need to diferentiate between correct HTML and made up
constructs, sorry for beeing unclear there. My goal is to first filter
_enough_ of everything that is not accurate text

What constitutes _accurate_ text to you, and to the next guy are obviously very different. What happens when someone does:

<<< text written by joe schmoe >>>

For quoting?  Most likely, you blintz it.  All.

I detailed a number of other conditions which will trip up anything that simply runs through the message and tries to pair up <> markers, notably if the message IS NOT sent as HTML. Or if it's a plaintext QUOTE of an HTML doc, etc.

You should demime the messages as well - quoted-printable and base-64 encoded messages won't scan very well at all. Of course, IME, such messages are spam anyway, so why scan them, when you can file 'em away on that simple characteristic unto itself?

Oh, and a rising star of spam is to send HTML messages with javascript, which writes parts of the displayed document via the script.

analysis from there. That would drasicly increase my performance since
most spam i get uses cheep stuff like s<ds>lu<dsfds>ts.

If they're not PAIRED (open <ds> and close </ds>, even though that isn't a legit token), then how do you know you're parsing correct token pairs when removing the supposed html constructs?

If you have "cheap stuff" why not see if you can't weigh it with scoring? Perhaps simply count the sheer number of them html tag markers:

:0
* 1^1 [^<]<
* 1^1 [^>]>

That, I think, should keep you from flagging instances where the symbols start the line, so multiple depths of quotes shouldn't pose a problem, cept when some hoser does:

>>:>>>  bleh
BL> bleh

and the like, which admittedly is increasingly common.

Of course, if it's a legit HTML document, you'll have EXACTLY the same number of < as you do >, because they should all be paired, and the text symbols themselves should be escaped: &lt; &gt; However, you're stuck with needing to split the HTML portion from the TEXT and other portions. Then, there's the BASE64 factor to consider.

Naturally, I would also then have to filter the message without removing the tags, and then I would find the rare cases when someone writes <porn> like
that and it is actually readable in the message.

Naturally, this means processing the message twice (or more), screening for the same things.

An again, the stuff I actually read should be unaltered - The replace
function would only be to put a temporary version of the message into a
varable for the spam filter to work on.

Which is why you'd pipe it to an external program, for example:

SCRUBBED=|someperlscrpt.pl

Or am I missing the entire point of yours?

Just that you probably don't need to go to these extents - there are ample indicators for false HTML (and the majority of spam is identifiable from the headers alone, which process a LOT faster than a potentially large body will).

At some point, you might come to realize that the added CPU overhead of churning every which way on a message to extract body elements is too far high for the handful of messages it will ever tag.

---
 Sean B. Straw / Professional Software Engineering

 Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail


<Prev in Thread] Current Thread [Next in Thread>