procmail
[Top] [All Lists]

Removing extra spaces while preserving indents

2002-06-23 23:40:02
The following is a .procmailrc that takes an HTML email message with
html2text at the beginning of the Subject, converts it to text using
lynx, removes the double justification lynx adds while preserving
indents, and mails it off.  The sed formatting is now doing what I want,
but it seems like there must be a more elegant solution.  

Here's the sample output that lynx -dump produces (the input is at
<http://www.dankohn.com/blog/2002_06_23_archive.html#85193263>):


   The NYT has a good [1]article about new television shows.

     At  each  network, 100 scripts become 20 pilots become half a dozen
     new  TV programs. The story of three first-time show creators, in a
     profligate system, up against impossible odds

   It  seems  that  Hunter  S.  Thompson's words still hold true: "The
TV
   business  is  a cruel and shallow money trench, a long plastic
hallway
   where thieves and pimps run free, and good men die like dogs."

References

   1. http://www.nytimes.com/2002/06/23/magazine/23TVPILOT.html


And here's the output I want (single justified instead of double
justified):


   The NYT has a good [1]article about new television shows.

     At each network, 100 scripts become 20 pilots become half a dozen
     new TV programs. The story of three first-time show creators, in a
     profligate system, up against impossible odds

   It seems that Hunter S. Thompson's words still hold true: "The TV
   business is a cruel and shallow money trench, a long plastic hallway
   where thieves and pimps run free, and good men die like dogs."

References

   1. http://www.nytimes.com/2002/06/23/magazine/23TVPILOT.html


The issue is that Lynx produces indents of 3 characters normally and 5
for blockquotes, but then double justifies the text, producing a bunch
of extraneous spaces. My regex gets rid of the spaces.  The following
works, but it seems like there must be a more elegant solution:

| lynx -dump -force_html -stdin \
| sed -e 's/^\ \ \ \ \ /bigindent/' \
| sed -e 's/\ \+/\ /g' \
| sed -e 's/^\ /\ \ \ /g' \
| sed -e 's/^bigindent/\ \ \ \ \ /' 

What I want is something like:

    sed -e '/\^[   |     ]/s/\ \+/\ /g'

But where the first /[stuff]/ "eats" the spaces at the beginning so that
the s//g only is used against the remaining spaces in the line.  Is this
possible, or should I just stick with the inelegant solution?


Attached is the actual .procmailrc, for context.  Thanks in advance for
any help you can offer.


#Uncomment the following lines and use tail -f procmail.log to debug
VERBOSE=yes
LOGFILE=$HOME/procmail.log
LOGABSTRACT=all

ME=dan(_at_)dankohn(_dot_)com
SUBJ_=`formail -xSubject:`
DIR=$HOME/.procmail



# html2text

:0w
* ^Subject: html2text.*
{


# Message is text/html with no multipart mixed, related, or
# alternative.  Process body with lynx -dump.
# Sed lines collapse spaces and then restore the indent
# Splitting across lines seems to break the sed commands

:0wb
* ^Content-Type: text/html.*
| lynx -dump -force_html -stdin | \
| sed -e 's/^\ \ \ \ \ /bigindent/' \
| sed -e 's/\ \+/\ /g' \
| sed -e 's/^\ /\ \ \ /g' \
| sed -e 's/^bigindent/\ \ \ \ \ /' \
| mutt $ME -s "converted: ${SUBJ_}"


# Message is multipart/alternative.  First part is text
# and should be discarded.  Second part is HTML and should
# be converted

:0w
* ^Content-Type: multipart/alternative.*
| munpack -t -C $DIR \
&& lynx -dump -force_html $DIR/part2 \
| sed -e 's/^\ \ \ \ \ /bigindent/' \
| sed -e 's/\ \+/\ /g' \
| sed -e 's/^\ /\ \ \ /g' \
| sed -e 's/^bigindent/\ \ \ \ \ /' \
| mutt $ME -s "converted: ${SUBJ_}" \
&& rm -f $DIR/*


# Message is multipart/mixed or related.  HTML is first part.

:0w
* ^Content-Type: multipart.*
| munpack -t -C $DIR \
&& lynx -dump -force_html $DIR/part1 \
| sed -e 's/^\ \ \ \ \ /bigindent/' \
| sed -e 's/\ \+/\ /g' \
| sed -e 's/^\ /\ \ \ /g' \
| sed -e 's/^bigindent/\ \ \ \ \ /' \
| mutt $ME -s "converted: ${SUBJ_}" \
&& rm -f $DIR/*
}

          - dan
--
Dan Kohn <mailto:dan(_at_)dankohn(_dot_)com>
<http://www.dankohn.com/>  <tel:+1-650-327-2600>
_______________________________________________
procmail mailing list
procmail(_at_)lists(_dot_)RWTH-Aachen(_dot_)DE
http://MailMan.RWTH-Aachen.DE/mailman/listinfo/procmail

<Prev in Thread] Current Thread [Next in Thread>