ietf-asrg
[Top] [All Lists]

RE: [Asrg] "more readable"

2003-06-23 12:17:41
From: "Hallam-Baker, Phillip" <pbaker(_at_)verisign(_dot_)com>

It is not that difficult to strip out HTML tags with a FSR of about
10 states. 
...

For now, I think a simplistic parser with fewer than 10 states related to
HTML is sufficient, but I anticipate needing a more complicated mechanism.

It's not merely that you need somewhat more than 10 states to detect
elements (as opposed to tags) in order to know, for example, where
font and color settings end.  Netscape 7 and Internet Explorer do
various visible things that depends on parsing that can only be done
with some backtracking.  A trivial example is that both consider
<!--this--> to be a comment but <!--not-this-- > contrary to
http://www.w3.org/TR/html401/intro/sgmltut.html#idx-HTML
Netscape 7 and Internet Explorer consider <!--stuff<and><nonsense-- >
as two tags instead of one comment.


Strange tags and com<!-- stupid stuff -->ments can be elided but they
are a very reliable spam indicator. If email has an html comment in the
middle of a spamword that increases the probability it is spam.

This is one case of a spam sender countermeasure that backfires.

That popular filters have been adjusted in that direction may be why
the more advanced spammers have stopped using that particular tactic.


Vernon Schryver    vjs(_at_)rhyolite(_dot_)com

_______________________________________________
Asrg mailing list
Asrg(_at_)ietf(_dot_)org
https://www1.ietf.org/mailman/listinfo/asrg



<Prev in Thread] Current Thread [Next in Thread>