procmail
[Top] [All Lists]

Re: lockfile on INCLUDERC to throttle?

1999-12-13 07:13:03
At 11:19 1999-12-13 +0000, Martin Ward wrote:

How about converting the headers to a file of strings
and using frep to look for the strings in the datafile?
Then you will be grepping for a small number of strings
in a very large file, instead of the other way around.

Several reasons. First one to frag the idea is the fact that the key list may consist of elements which would be considered SUBSTRINGS of what appears in the header. This similarly complicates the nature of the utility I want to write to address the problem.

If I'm going to deal with writing anything, it'll be a db-based mechanism as I metioned in my first post - that would gain me random access to the key list, which I could search without having to read the entire file. Unfortunatley, it does mean that I have to contend with file growth as a result of the index and/or fixed-length database records.

Fortunatley, if I were to process the message header and parse it into components which I'd search on, I could also flag those components that I've already searched on (no need to repeat a db lookup). Take a procmail header for example the domain string - "rwth-aachen.de" appears approx 10 times in any given list message. Reducing the number of duplicate lookups will improve performance.

Seems to me, cacheing methodologies may prefer that I look for a large number of strings in a small file (presuming fgrep steps through the input file sequentially, which it should): the message header becomes intensively cached, and each element from the large file is read into memory once, whereas if I was reading the headers into a split up list and seeking through the large file multiple times (ALL the way through it in any case in which a match isn't produced), it'd be a real pain. OTOH, I don't know why fgrep should use 15-40MB of RAM for something that seems so straightforward. I suspect it may be producing one VERY complex expansion of the headers...

Exactly how you split the headers into strings depends
on what you are looking for: whole email addresses,
fully qualified domain names, partial domain names,
or something else...

The massive file is a list of base domains with TLD, no hostname.

Other, less intensive, spam and twit searches use email addresses or key phrases (i.e. whitespace and other symbols may exist). If it were necessary for optimization, I could easily strip keywords out to a separate file and process them with a separate recipe, but for the time being, those searches don't involve the consumption of resources which the primary spam filter does.

BTW - the domain-based filter takes place AFTER all other detection schemes have been tried, to ensure that if there is a less costly way to ditch it as spam (and also more _positive_), it will have been done before consuming resources. Of course, valid mail runs through the whole filter all the time...

---
 Please DO NOT carbon me on list replies.  I'll get my copy from the list.

 Sean B. Straw / Professional Software Engineering
 Post Box 2395 / San Rafael, CA  94912-2395

<Prev in Thread] Current Thread [Next in Thread>