This is (much) harder, in the general case, than it looks. This is due to
the famous looseness of what is considered "HTML". (This laxity was once
touted by HTML developers as a desirable feature, and probably did promote
HTML's adoption in some respects.) HTML being more or less tag soup, saving
it as plain text more or less means implementing a parser, a major part of
a browser (XML parsing is comparatively trivial).
If you can constrain the "HTML" coming in to a controlled dialect of XML
(using HTML tags if you like for browser friendliness), you can achieve
this straightforwardly using stylesheets.
Alternatively, if you truly have to accept arbitrary "HTML", you can look
at parsing technologies such as HTML tag soup parsers (see e.g.
http://mercury.ccil.org/~cowan/XML/tagsoup/) that will emit XML SAX parsing
events from HTML, or HTML DOM implementations that can write out XML from
HTML, or an analogous tool; such a processor can be hooked into an XML
When it comes to writing out nice plain text output with XSLT (which is a
perfectly fine tool for the job), you may find multiple passes to be a good
way to proceed in any case.
Generally, XSLT can't be used on arbitrary HTML. A poor man's solution is
to use a tool like HTML Tidy to make XML for XSLT from the HTML, but I
don't know if that could be adapted to your requirement for "a platform
independent way" (IIRC it is compiled for different platforms).
But if in general HTML-to-formatted-plain-text were easy, I think we'd see
lots more of it.
At 03:15 PM 6/21/2004, you wrote:
I am looking around for any tools to convert html to plain text in a
platform independent way. I also need support for UTF-8 encoding as well
as a well formatted output of nested tables. What is the best way to do
this ? Is XSL FO recommended for this ? I looked around for any XSL to
convert HTML to FO, but I did not find any.
The html to text tools I found on web are mostly windows based. The
remaining are not very good at converting nested tables in HTML to a
properly rendered plain text format.
I appreciate any help
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
"Thus I make my own use of the telegraph, without consulting
the directors, like the sparrows, which I perceive use it
extensively for a perch." -- Thoreau