Re: Removing line-wrapped header: gawk

On  9 Sep, Volker Kuhlmann wrote:

[...]
Any perl solution suffers from huge overhead, but is more portable. As I
don't speak it I tend to avoid it...


Fair enough.  Not to argue, because you're right, and I agree it's
virtuous to to minimize overhead [1], but ...

Because the servers are mine, I feel free to experiment with all sorts
of things I'd avoid if using someone else's.  I have some amazingly
contorted stuff in my rcfiles, including adding benchmark info to each
message, and to a logfile, that has machine load, number of rcfiles
visited, last rcfile visited, and time (to 1/10000 sec) for both
"system" and personal rcfiles.  As you might guess, anyone doing that
is doing some pretty weird stuff in between.  That would be a good
guess. ;-)  One of these days I'm going to get some of the stuff
published but, for now, I'll just say I've been amazed at how low the
"benchmarks" are relative to what's being done. [2]  And there's a
little bit of perl in there. ;-)

Anyway this isn't about advocating perl.  It's the only thing I speak,
for no other reasons than that I'm just a dumb bond trader who does this
for fun, my capacity for this stuff is limited, and it's the first thing
I was introduced to.

I know Volker doesn't care about this, as he already has a working
solution.  This is to fix the 2 line/header limitation, to point out
another caveat, and to leave a more comprehensive solution for
posterity than the nonsense posted way too late last night.  It also
adds explanations.

First, the caveat.  I got thinking the match pattern might wrap from
one line to another, in which case any whitespace might get turned
into a <newline><tab> combination.  (I think Volker said earlier a
continuation line can legitimately begin with a <tab> or <space>. He's
right and the code below should work for either.)  So I worked with a
message that had the following headers interspersed, and it got rid of
the last two.

Kuhlmann: line 1a
        line 2a
Kuhlmann: line 1b
        line 2b, but dump this crap
Kuhlmann: line 1c
        line 2c, but dump
        this crap, too line 3c

Here's the code as it might go into file a for execution from procmail:

---(cut here)---
#!/usr/local/bin/perl
# $pattern = ", but dump this crap";
$pattern = shift || die "Kuhlmann: filter requires pattern arg\n";
$pattern =~ s/ /\\s+/g;
@bork = ();
@head = <>;
for my $i (0..$#head)
{
   next if grep { $i eq $_ } @bork;
   $_ = $head[$i];
   if( /^Kuhlmann:/ )
   {
      my $go = $i;
      while(1)
      {
         my $lookahead = $head[++$go];
         last if $lookahead =~ /^\S+:/;
         push @bork, $go;
         $_ .= $lookahead;
      }
      next if /$pattern/;
      print;
      next;
   }
   print;
}
---(cut here)---
 
Usage example, assuming it's saved to file "script" in your $PATH:

:0 fhw
| script ', but dump this crap'

Breaking it down:

Uncomment the first and comment the second, if the pattern never changes.
The second allows you to pass a pattern it it's not constant.

     # $pattern = ", but dump this crap";
     $pattern = shift || die "Kuhlmann: filter requires pattern arg\n";

In the pattern, replace globally (/g) each <space> literal with perl's
"\s" to match any whitespace (including \n, \t, \r), modified with "+"
for one or more.  This allows the pattern itself to span multiple lines.

     $pattern =~ s/ /\\s+/g;

Set up some list (array) variables.  @head will slurp standard input
with each list element containing one \n terminated line.  In other
words, multi-line headers are still multi-line.

We will be looking ahead for relevant continuation lines, and @bork will
track matches which should be ignored when encountered again.  This
might be unnecessary, but my recollection is manipulating lists (e.g.
deleting elements) while iterating through them may cause unpredictable
results.  I didn't spend the time to test simpler solutions.

     @bork = ();
     @head = <>;

Iterate through @head by using index variable $i, which is sequentially
assigned a number 0 through $#head (last element's index) each time
through the loop.

     for my $i (0..$#head)

Do nothing if index $i is equal to any element of @bork, which means
we've already seen this element and don't care about it.

        next if grep { $i eq $_ } @bork;

Assign current @head element to perl's "magic" $_ variable for pattern
matching and printing below.

        $_ = $head[$i];

If the current header line ($_) matches ^Kuhlmann:, enter the loop.

        if( /^Kuhlmann:/ )

Assign current value of $i to $go, to be used to look ahead at @head.

           my $go = $i;

Infinite loop. Yikes!

           while(1)

While looping, assign $lookahead by incrementing $go (++$go) and, at
same time, use this incremented value to index @head.

              my $lookahead = $head[++$go];

Exit infinite loop if $lookahead matches one or more non-whitespace
characters followed by a colon, at the beginning of a line (i.e. it's
the start of a new header.

              last if $lookahead =~ /^\S+:/;

Otherwise it is a continuation line of a Kuhlmann: header.  We won't
want to look at it in the outer loop again, so add the value of $go
to the @bork list.

              push @bork, $go;

Concatenate $lookahead to $_.

              $_ .= $lookahead;

Now we have a complete Kuhlmann: header.  If it matches $pattern,
then ignore it by jumping to top of outer loop.

           next if /$pattern/;

Otherwise print it.

           print;

We're done with this Kuhlmann: header, matched or not, so jump to top
of outer loop.

           next;

It was not any part of a Kuhlmann: header, so print it.

        print;

See, that wasn't so bad, was it?

If a person wanted to use this without putting it in an external file,
it could be done like this:

:0 fhw
|perl -e '$pattern=", but dump this crap";$pattern =~ s/ /\\s+/g;'\
      -e '@bork=();@head=<>;for my $i (0..$#head){next if grep{$i eq $_}'\
      -e '@bork;$_=$head[$i];if(/^Kuhlmann:/){my $go=$i;while(1){'\
      -e 'my $lookahead=$head[++$go];last if $lookahead =~ /^\S+:/;'\
      -e 'push @bork, $go;$_ .= $lookahead;}next if /$pattern/s;'\
      -e 'print;next;}print;}'

N.B. This version assumes the $pattern is constant.  If it needs to
be passed, the code above needs modification.

FWIW, I tried this by putting the entire header area into a single
string variable, and trying to delete matching headers with a regex,
but wasn't able to make it work.  It can probably be done, but not by
me right now.

I know this was probably only of academic interest (if any at all),
but I wasn't happy with what I put out last night and wanted to
modify the record.

Back to work now ...

Don


[1] http://www.xray.mpe.mpg.de/mailing-lists/procmail/2001-09/msg00007.html

[2] For example, a daily message has a mime atachment and uses metamail
    to extract it, unzip to unzip it, perl to split 8 statements
    concatenated into one text file back into their component parts,
    more perl to individually process each statement and distill them
    to something more useful and mail the new statements, and saves
    the originals to date named files in their home while making sure
    not to clobber others (in case revised versions arrive).  There's
    probably more, and it's all on top of the weird stuff I do for
    every incoming message, and it hits 154 "system" rcfiles in 1.9206
    secs and 34 personal rcfiles in 1.9467 secs (at a .04 load).  Note
    those "benchmarks" use perl on the front and back ends (of course!),
    adding still more overhead.  That's on at least 5 year old hardware,
    and that doesn't seem too bad to me.


-- 
Email address in From: header is valid  * but only for a couple of days *
This is my reluctant response to spammers' unrelenting address harvesting


____________________________________________________________
procmail mailing list   Procmail homepage: http://www.procmail.org/
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail