Lars Kellogg-Stedman asked,
| Syslog on some machines is broken, and instead of adding a note to the
| effect of "previous message repeated 271 times", it will include all 271
| seperate log messages.
|
| I don't need to see all of these.
| What's the most efficient way of filtering lines that are repeated more
| than <n> times? If I was trying to match specific text, this would be
| trivial, but as I'm trying to match *any* repeated text I'm not sure how
| to proceed.
Well, let's think ... procmail does not support back-references, but you
can extract into $MATCH and then see if $MATCH recurs.
Is there an easy way to identify syslog mailings? Can you then sort -u
on a field (such as the message text?) that will be alike on all the
lines that are duplicates, ignoring the timestamp? Something like
savemetas=$SHELLMETAS
SHELLMETAS
:0bf
| sort -u -t'<' -k2 # sort -u -t'<' +1 if your sort doesn't do -k
SHELLMETAS=$savemetas
Now, that won't get you a count (grep -c would, or procmail scoring can).
I'm not sure whether you want one or not.
I think we need more detail as to exactly what these lines look like, what
else can occur in the same message, and what part has to be identical (every-
thing except the timestamp, including the hostname?) for you to want them
grouped together. Also, what if this happens:
198 lines that should be grouped together
1 different line
73 more lines like the first 198
Would you want the seventy-three combined with the 198 to make 271 or kept
separate? If you want them kept separate, a sed script with proper use of N
and D and maybe P would work; if you want them combined, sort -u is probably
the answer. Either way, getting a count will be a headache.
Here's a sample sed recipe (I hate putting sed commands into procmailrcs,
though):
savemetas=$SHELLMETAS
SHELLMETAS
:0bf
| sed -e '$!N' -e '/\( [^ ]* <[^>]*>\)\n.*\1/!P' -e D
SHELLMETAS=$savemetas
That should eliminate all but the last line of a run of lines where the
hostname and the message are identical.