Re: debugging \<(xxx|yyy|zzz)\>

On Fri, May 28, 2004 at 09:07:30PM -0700, Jim Osborn wrote:

When the spammer is in control of the string, then, there's no way to
count "wild teens" as two words with, say, (\<wild\>|\<teens?\>)
without risking counting "wilderness in thirteen" as two bad words
if I remove the word delimiters.  Hmmmmm....

If you didn't find a solution, I'm sure not likely to! :)


Hi, Jim,

As for not finding a solution, I think -- but really, I only
skimmed the previous thread, so I hope I am restating this
correctly -- that what David didn't find a solution to was
merely an efficient way to write a regex that overlaps words.
I don't think that exists.  I have looked at the animal
before too.  Please note qualifying adjective, "efficient";
by which we mean here, _without having to repeat the overlapped
word in the regex_.  It's merely a point of arcane expression
I think David was talking about, but not something to stop you
from writing a recipe to do exactly what you want.

IOW, I am sure David has no problem constructing a recipe
algorithm to do what you have in mind.  It is eminently doable
to write, in procmail condition form, a rule that we can
imagine in our heads.  :-)

You want, I think, "wild" plus one or more whitespace OR
newline OR whitespace+newline, plus "teen"; with the two
words bounded by procmail word delimiters.  Right?

Assuming you have whitespace defined as a space and a tab
and which we'll call "$WS":

 :0 B:
 * $  9876543210^0  ()\<(wild[$WS]+teens\>
 * $  9876543210^0  ()\<(wild[$WS]*$+[$WS]*teens\>
 wildteens

should be it.

(Not tested.)

P.S.  You're not the Jim Osborn I knew in Heidelberg in the late-
middle ninetees, right?

-- 
dman

_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail