procmail
[Top] [All Lists]

Re: matching words that are laced with html

2003-10-30 05:17:02
On Thu, Oct 30, 2003 at 09:27:56AM +0000, Ian Prideaux wrote:

Shea Gray wrote:

I wanted to find out how you guys go about matching words that have
html comments in them like the one below....

Viag<!-- necessity -->ra

Im not sure how I should go about writing the regexp for this,
so some of your examples or suggestions would be nice. Thanks in
advance.

I guess that it would be something like:

V<!--.*-->i<!--.*-->a<!--.*-->g<!--.*-->r<!--.*-->a

but IIRC there's something odd about the way procmail interprets
<> signs, you need to enclose them in [] brackets to stop procmail
interpreting them as special characters. Do you need to escape the !
as well?

No.  Don't start any ugly rumors, Ian.  Procmail interprets < and >
just fine as literals in condition lines.  One does need for the char
not to be the first such char of a condition is escaped, though, because
otherwise procmail assumes one is using the special operator to test
message size.

Here are two valid examples of < or > appearing on condition lines:

   :0
   * To:.*<foobar(_at_)example(_dot_)com>
   ! forward(_at_)somewhere(_dot_)tld


   :0 fw
   * < 250000
   | spamc -d spamd


And here is "<" quoted because procmail would otherwise think it's an
operator, as it is in the last recipe above:

  FROM = `formail -zx From:`

  :0:
  * FROM ?? ()<
  $DEFAULT



The original poster wants to reconstruct words in message bodies that
are broken up by HTML comments.  I have this to say about that, and
I've said similarly before:  First, the algorithm would be a bloody
mess.  But second, and, one might say, epistemologically speaking, the
heuristic is bass-ackwards and nearly pointless.

Let me explain, as I've done before, by way of an analogy:  You hear
noises downstairs in the middle of the night.  You suspect a burglar
or a trespasser has broken into your house.  You grab a baseball bat
and run downstairs.  The "burglar," who was actually a vandal, is gone,
but your living room has been graffiti-tagged.  The paint is all over,
including across the face of your signed-original, wall-size Andy Warhol
print and your prize, imitation Venus de Milo statue.  But wait, here's
the kicker:  There are apparent messages sprayed everywhere with the
paint!  However, they are in a foreign language, and you don't know
what they say.  Now you stand there in your pajamas, your baseball bat
hanging limp at your side, and you ask yourself an important (NOT!)
question:  "Maybe I should call up my cousin Emilio, who speaks
Armenian,[1] so I can ask him what this stuff says, IN ORDER TO DECIDE
IF A CRIMINAL ACT HAS TRANSPIRED HERE AND I SHOULD THEREFORE CALL THE
POLICE?"

My point, which should be more than obvious, is, don't be an idiot.
Think it through.  You don't need to put yourself, your .procmailrc, or
your server through high-wire circus-act hoops to know that a piece of
email of HTML-only MIME-type containing dozens of HTML comments, with
many of them often appearing on a single line, IS NOT YOUR AUNT MARTHA
TRYING TO WRITE YOU THAT HER EMAIL ADDRESS HAS CHANGED!!!!  The message
is spam, pure and simple; and we don't need, programmatically, to call
up the equivalent of Cousin Emilio to interpret the strange messages in
order to know that this message is spam.

The code you write is an algorithm.  It represents a How-To for
your heuristic.  Your heuristic is what drives the big picture.
It stands for the question, "What am I trying to do, and Why?
The algorithm (code) is the How.  But your What and your Why
should make good logical sense, first!

As I implied up above, it is complicated to decode the HTML comments
in procmail.  It can be done, in a high-wire-circus-act kind of
a way.  I would only ask one question, though: Why?



[1] Look, I had to pick something.  I chose Armenien, but I might as
well have said Martian.  The choice has nothing to do with any implied
insult against Armeniens as taggers.  I just needed a langauge few who
are casually present are likely to be fluent in.  Okay?

-- 
dman

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail