At 11:19 1999-12-13 +0000, Martin Ward wrote:
How about converting the headers to a file of strings
and using frep to look for the strings in the datafile?
Then you will be grepping for a small number of strings
in a very large file, instead of the other way around.
Several reasons. First one to frag the idea is the fact that the key list
may consist of elements which would be considered SUBSTRINGS of what
appears in the header. This similarly complicates the nature of the
utility I want to write to address the problem.
If I'm going to deal with writing anything, it'll be a db-based mechanism
as I metioned in my first post - that would gain me random access to the
key list, which I could search without having to read the entire
file. Unfortunatley, it does mean that I have to contend with file growth
as a result of the index and/or fixed-length database records.
Fortunatley, if I were to process the message header and parse it into
components which I'd search on, I could also flag those components that
I've already searched on (no need to repeat a db lookup). Take a procmail
header for example the domain string - "rwth-aachen.de" appears approx 10
times in any given list message. Reducing the number of duplicate lookups
will improve performance.
Seems to me, cacheing methodologies may prefer that I look for a large
number of strings in a small file (presuming fgrep steps through the input
file sequentially, which it should): the message header becomes
intensively cached, and each element from the large file is read into
memory once, whereas if I was reading the headers into a split up list and
seeking through the large file multiple times (ALL the way through it in
any case in which a match isn't produced), it'd be a real pain. OTOH, I
don't know why fgrep should use 15-40MB of RAM for something that seems so
straightforward. I suspect it may be producing one VERY complex expansion
of the headers...
Exactly how you split the headers into strings depends
on what you are looking for: whole email addresses,
fully qualified domain names, partial domain names,
or something else...
The massive file is a list of base domains with TLD, no hostname.
Other, less intensive, spam and twit searches use email addresses or key
phrases (i.e. whitespace and other symbols may exist). If it were
necessary for optimization, I could easily strip keywords out to a separate
file and process them with a separate recipe, but for the time being, those
searches don't involve the consumption of resources which the primary spam
filter does.
BTW - the domain-based filter takes place AFTER all other detection schemes
have been tried, to ensure that if there is a less costly way to ditch it
as spam (and also more _positive_), it will have been done before consuming
resources. Of course, valid mail runs through the whole filter all the time...
---
Please DO NOT carbon me on list replies. I'll get my copy from the list.
Sean B. Straw / Professional Software Engineering
Post Box 2395 / San Rafael, CA 94912-2395