Re: lockfile on INCLUDERC to throttle?

At 11:19 1999-12-13 +0000, Martin Ward wrote:

How about converting the headers to a file of strings
and using frep to look for the strings in the datafile?
Then you will be grepping for a small number of strings
in a very large file, instead of the other way around.

Several reasons. First one to frag the idea is the fact that the key listmay consist of elements which would be considered SUBSTRINGS of whatappears in the header. This similarly complicates the nature of theutility I want to write to address the problem.

If I'm going to deal with writing anything, it'll be a db-based mechanismas I metioned in my first post - that would gain me random access to thekey list, which I could search without having to read the entirefile. Unfortunatley, it does mean that I have to contend with file growthas a result of the index and/or fixed-length database records.

Fortunatley, if I were to process the message header and parse it intocomponents which I'd search on, I could also flag those components thatI've already searched on (no need to repeat a db lookup). Take a procmailheader for example the domain string - "rwth-aachen.de" appears approx 10times in any given list message. Reducing the number of duplicate lookupswill improve performance.

Seems to me, cacheing methodologies may prefer that I look for a largenumber of strings in a small file (presuming fgrep steps through the inputfile sequentially, which it should): the message header becomesintensively cached, and each element from the large file is read intomemory once, whereas if I was reading the headers into a split up list andseeking through the large file multiple times (ALL the way through it inany case in which a match isn't produced), it'd be a real pain. OTOH, Idon't know why fgrep should use 15-40MB of RAM for something that seems sostraightforward. I suspect it may be producing one VERY complex expansionof the headers...

Exactly how you split the headers into strings depends
on what you are looking for: whole email addresses,
fully qualified domain names, partial domain names,
or something else...


The massive file is a list of base domains with TLD, no hostname.

Other, less intensive, spam and twit searches use email addresses or keyphrases (i.e. whitespace and other symbols may exist). If it werenecessary for optimization, I could easily strip keywords out to a separatefile and process them with a separate recipe, but for the time being, thosesearches don't involve the consumption of resources which the primary spamfilter does.

BTW - the domain-based filter takes place AFTER all other detection schemeshave been tried, to ensure that if there is a less costly way to ditch itas spam (and also more _positive_), it will have been done before consumingresources. Of course, valid mail runs through the whole filter all the time...


---
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

 Sean B. Straw / Professional Software Engineering
 Post Box 2395 / San Rafael, CA  94912-2395