procmail
[Top] [All Lists]

Re: Recipe for converting HTML mails to...

1998-10-06 00:06:28
|Mon 1998-10-05 Jacques Gauthier <jacques_g(_at_)yahoo(_dot_)com> list.procmail
| > lynx -dump is the best popular solution and a little more rough
| solution is
| > the perl onliner:
| 
| Would that be:
| 
| * H ?? (<HTML>)
| | lynx -dump | rest_of_treatement ?

I must have old lynx here, becasue it doesn't accept stdin, so
I have to save the content to a file before calling lynx. I'd choose
the perl one liner due to simpler recipe handling. (provided that accuracy
is not a concern).

Here is section that I just added to pm-tips.txt. Suggestions welcome.
jari

    14.7 Converting HTML body to plain text

        The most popupar solution to convert html body into plain text is to
        use `lynx'. Another more straightforward method is to use perl one
        liner: it quicker, easier to use with procmail but it doesn't pretend
        to know about HTML DTD. The recipe below should be taken with grains of
        salt: seeing HTML tag is no guarrantee that the body "only" has html. A
        cautious recipe writer also watches for MIME mltipart messages. (See
        `pm-jamime.rc' to draw some mime characteristics from message)

        This recipe has been written so that you can add more alternative
        html conversion scripts. You may even want to select the appropriate
        conversion for a message: eg. perl for non important ones.

            :0 B
            * ()<HTML>
            * ()</HTML>
            {
                conversion = "lynx"     # or select this conditionally

                :0
                * conversion ?? lynx
                {
                    file = "$HOME/tmp/msg.html"

                    LOCKFILE = $file$LOCKEXT

                    :0 bwc
                    | cat - > $file;  lynx -dump $file > $file.plain

                    :0 fbwi
                    | cat $file.plain

                    LOCKFILE
                }

                :0 E fbw
                | perl -0777 -pe 's/<[^>]*>//g'

            }