|Mon 1998-10-05 Jacques Gauthier <jacques_g(_at_)yahoo(_dot_)com> list.procmail
| > lynx -dump is the best popular solution and a little more rough
| solution is
| > the perl onliner:
|
| Would that be:
|
| * H ?? (<HTML>)
| | lynx -dump | rest_of_treatement ?
I must have old lynx here, becasue it doesn't accept stdin, so
I have to save the content to a file before calling lynx. I'd choose
the perl one liner due to simpler recipe handling. (provided that accuracy
is not a concern).
Here is section that I just added to pm-tips.txt. Suggestions welcome.
jari
14.7 Converting HTML body to plain text
The most popupar solution to convert html body into plain text is to
use `lynx'. Another more straightforward method is to use perl one
liner: it quicker, easier to use with procmail but it doesn't pretend
to know about HTML DTD. The recipe below should be taken with grains of
salt: seeing HTML tag is no guarrantee that the body "only" has html. A
cautious recipe writer also watches for MIME mltipart messages. (See
`pm-jamime.rc' to draw some mime characteristics from message)
This recipe has been written so that you can add more alternative
html conversion scripts. You may even want to select the appropriate
conversion for a message: eg. perl for non important ones.
:0 B
* ()<HTML>
* ()</HTML>
{
conversion = "lynx" # or select this conditionally
:0
* conversion ?? lynx
{
file = "$HOME/tmp/msg.html"
LOCKFILE = $file$LOCKEXT
:0 bwc
| cat - > $file; lynx -dump $file > $file.plain
:0 fbwi
| cat $file.plain
LOCKFILE
}
:0 E fbw
| perl -0777 -pe 's/<[^>]*>//g'
}