Professional Software Engineering wrote:
At 12:42 2004-01-21 -0500, Tom Limoncelli wrote:
Occasionally I get HTML email without a multipart/alternative that is
plaintext.
So do I. I call it spam. <g>
Does anyone have a script that would turn the html into plaintext
and add it as a multipart/alternative?
Autoreply to the sender and tell them to fix their email client?
I found this:
http://www.xent.com/pipermail/fork/2002-June/012749.html
but it strips the HTML out completely.
Convert the HTML to plaintext (the above tool might do that, I didn't
follow the link -- or lynx can), then take the original text and the
HTML (and whatever ELSE might have been in there), and feed it into
"mimencode" to generate a new message.
One must wonder: if you WANT these messages, and the absence of a
plaintext part is a problem, why not upgrade to an MUA that can handle
it -- OR, if your MUA can't, why is just rewriting it to plaintext a
problem?
another species of spam is the kind where the HTML differs fundamentally
from the plaintext. a useful spam detection technique would do
something like perform a checksum of the vowels in the plaintext and the
tag-stripped HTML and if these two numbers differ, consider it spam.
also - to catch these blocks of random words, calculate the average word
size of the spam and see if it's far in excess of the normal (4-5
characters per word). but only do this in cases where the word count
exceeds a certain number so as to avoid false positives.
--gus
http://spies.com/gus/
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail