procmail
[Top] [All Lists]

when plaintext differs from html

2004-01-21 15:57:51
Professional Software Engineering wrote:
At 12:42 2004-01-21 -0500, Tom Limoncelli wrote:

Occasionally I get HTML email without a multipart/alternative that is
plaintext.


So do I.  I call it spam. <g>

Does anyone have a script that would turn the html into plaintext
and add it as a multipart/alternative?


Autoreply to the sender and tell them to fix their email client?

I found this:
http://www.xent.com/pipermail/fork/2002-June/012749.html
but it strips the HTML out completely.


Convert the HTML to plaintext (the above tool might do that, I didn't follow the link -- or lynx can), then take the original text and the HTML (and whatever ELSE might have been in there), and feed it into "mimencode" to generate a new message.

One must wonder: if you WANT these messages, and the absence of a plaintext part is a problem, why not upgrade to an MUA that can handle it -- OR, if your MUA can't, why is just rewriting it to plaintext a problem?



another species of spam is the kind where the HTML differs fundamentally from the plaintext. a useful spam detection technique would do something like perform a checksum of the vowels in the plaintext and the tag-stripped HTML and if these two numbers differ, consider it spam.

also - to catch these blocks of random words, calculate the average word size of the spam and see if it's far in excess of the normal (4-5 characters per word). but only do this in cases where the word count exceeds a certain number so as to avoid false positives.

--gus
http://spies.com/gus/



_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail