BTW, I hope it isn't oo much to ask, but would you mind *NOT* quoting
entire previous posts when you're not addressing specific elements of
them? Everyone on this list has access to the original messages - we don't
need countless copies of them resent to us at the bottoms of replies tacked
on by MS OutBreak.
At 16:05 2003-07-05 -0700, Björn Lilja did say:
I do not need to diferentiate between correct HTML and made up
constructs, sorry for beeing unclear there. My goal is to first filter
_enough_ of everything that is not accurate text
What constitutes _accurate_ text to you, and to the next guy are obviously
very different. What happens when someone does:
<<< text written by joe schmoe >>>
For quoting? Most likely, you blintz it. All.
I detailed a number of other conditions which will trip up anything that
simply runs through the message and tries to pair up <> markers, notably if
the message IS NOT sent as HTML. Or if it's a plaintext QUOTE of an HTML
doc, etc.
You should demime the messages as well - quoted-printable and base-64
encoded messages won't scan very well at all. Of course, IME, such
messages are spam anyway, so why scan them, when you can file 'em away on
that simple characteristic unto itself?
Oh, and a rising star of spam is to send HTML messages with javascript,
which writes parts of the displayed document via the script.
analysis from there. That would drasicly increase my performance since
most spam i get uses cheep stuff like s<ds>lu<dsfds>ts.
If they're not PAIRED (open <ds> and close </ds>, even though that isn't a
legit token), then how do you know you're parsing correct token pairs when
removing the supposed html constructs?
If you have "cheap stuff" why not see if you can't weigh it with
scoring? Perhaps simply count the sheer number of them html tag markers:
:0
* 1^1 [^<]<
* 1^1 [^>]>
That, I think, should keep you from flagging instances where the symbols
start the line, so multiple depths of quotes shouldn't pose a problem, cept
when some hoser does:
>>:>>> bleh
BL> bleh
and the like, which admittedly is increasingly common.
Of course, if it's a legit HTML document, you'll have EXACTLY the same
number of < as you do >, because they should all be paired, and the text
symbols themselves should be escaped: < > However, you're stuck
with needing to split the HTML portion from the TEXT and other
portions. Then, there's the BASE64 factor to consider.
Naturally, I would also then have to filter the message without removing
the tags, and then I would find the rare cases when someone writes <porn> like
that and it is actually readable in the message.
Naturally, this means processing the message twice (or more), screening for
the same things.
An again, the stuff I actually read should be unaltered - The replace
function would only be to put a temporary version of the message into a
varable for the spam filter to work on.
Which is why you'd pipe it to an external program, for example:
SCRUBBED=|someperlscrpt.pl
Or am I missing the entire point of yours?
Just that you probably don't need to go to these extents - there are ample
indicators for false HTML (and the majority of spam is identifiable from
the headers alone, which process a LOT faster than a potentially large body
will).
At some point, you might come to realize that the added CPU overhead of
churning every which way on a message to extract body elements is too far
high for the handful of messages it will ever tag.
---
Sean B. Straw / Professional Software Engineering
Procmail disclaimer: <http://www.professional.org/procmail/disclaimer.html>
Please DO NOT carbon me on list replies. I'll get my copy from the list.
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail