Re: how to match against first N lines/bytes of body?


Thanks for all the ideas!

--"J. Daniel Smith" <19971113204103(_dot_)202711(_dot_)FMU21649(_at_)handel>--


have the right statistical properties. I'm convinced that simply 
filtering just on the first 50 or so lines would be much better.


If that's the case, you could simply limit your spam filtering to
those messages that are less than N lines


Yes, but that's not quite what I'm after. I am conjecturing
that all my "B" recipes (whether related to spam or not) can
make correct decisions (in the case of spam, actually make *better* decisions)
by looking only at the first 50-100 lines of the message, no matter 
how long the message actually is. So I'd rather not simply ignore
longer messages.

--"David W. Tamkin" <m0xW6Ki-000k13C(_at_)miso(_dot_)wwa(_dot_)com>--


If N had been small, say 6 as in a previous thread about this, I'd
suggest this, which allows for up to five (6 - 1) previous lines;
leave out the last ".*" just before "pattern" if the pattern is
left-anchored:

  :0B
  * ^^(.*$(.*$(.*$(.*$(.*$)?)?)?)?)?.*pattern
  whatever

But if N=50, to heck with it:

  :0Bi
  toplines=| head -$N # sed ${N}q if you don't have head

  :0a
  * toplines ?? pattern
  whatever


Thanks. Extracting a varible like "toplines" will work just 
as in your example in a number of cases I have in mind.
BUT, I really wanted to match the truncated body against various scoring 
recipes rather than true/false tests. Is there a way of doing this?
I realize that I could simulate scoring recipes (more or less)
with regular ones, but not always very cleanly or concisely.

--era eriksson 
<199711141045(_dot_)MAA13924(_at_)kontti(_dot_)Helsinki(_dot_)FI>--

On Thu, 13 Nov 1997 14:57:48 -0600 (CST),
"David W. Tamkin" <dattier(_at_)miso(_dot_)wwa(_dot_)com> wrote:
 > But if N=50, to heck with it:
 >   :0Bi

Can't you just do a MATCH grab instead of run an external process? 
The regex to grab fifty lines (or all of them, if there are less than
fifty) is not going to be very pretty, but it should be a lot more
efficient. 

    :0B
    * ^^([    ]*|$)*\/[^      ].*$(.*$)?(.*$)? ... etc, another 47 of'em
    { toplines="$MATCH" }

The skipping of whitespace at the beginning of the message is of
course not necessary or anyhitng. You should probably set LINEBUF
reasonably high (at the very least 80*50 = 4000 bytes; probably
setting it to 8192 or 16384 is a good idea while you're at it) in
order to avoid trouble.

/* era */


Won't this be a danger on non-text or badly formatted messages?
(I.e, might overrun linebuf, since there's no a priori bound
on how long the first "line" in a message might be.)
With "head" I can extract a specified number of bytes.

Thanks,
  Adam Grove