procmail
[Top] [All Lists]

Re: Matching repeating lines?

1997-02-10 14:17:27
Lars Kellogg-Stedman asked,

| Syslog on some machines is broken, and instead of adding a note to the 
| effect of "previous message repeated 271 times", it will include all 271 
| seperate log messages.
| 
| I don't need to see all of these.

| What's the most efficient way of filtering lines that are repeated more 
| than <n> times?  If I was trying to match specific text, this would be 
| trivial, but as I'm trying to match *any* repeated text I'm not sure how 
| to proceed.

Well, let's think ... procmail does not support back-references, but you
can extract into $MATCH and then see if $MATCH recurs.

Is there an easy way to identify syslog mailings?  Can you then sort -u
on a field (such as the message text?) that will be alike on all the
lines that are duplicates, ignoring the timestamp?  Something like

  savemetas=$SHELLMETAS
  SHELLMETAS
  :0bf
  | sort -u -t'<' -k2 # sort -u -t'<' +1 if your sort doesn't do -k
  SHELLMETAS=$savemetas

Now, that won't get you a count (grep -c would, or procmail scoring can).
I'm not sure whether you want one or not.

I think we need more detail as to exactly what these lines look like, what
else can occur in the same message, and what part has to be identical (every-
thing except the timestamp, including the hostname?) for you to want them
grouped together.  Also, what if this happens:

  198 lines that should be grouped together
  1 different line
  73 more lines like the first 198

Would you want the seventy-three combined with the 198 to make 271 or kept
separate?  If you want them kept separate, a sed script with proper use of N
and D and maybe P would work; if you want them combined, sort -u is probably
the answer.  Either way, getting a count will be a headache.

Here's a sample sed recipe (I hate putting sed commands into procmailrcs,
though):

  savemetas=$SHELLMETAS
  SHELLMETAS
  :0bf
  | sed -e '$!N' -e '/\( [^ ]* <[^>]*>\)\n.*\1/!P' -e D
  SHELLMETAS=$savemetas

That should eliminate all but the last line of a run of lines where the
hostname and the message are identical.