xsl-list
[Top] [All Lists]

RE: Converting HTML to plain text

2004-06-22 10:30:15
Radha,

Being able to constrain your incoming HTML will be a big help.

Given that, I think the easiest approach long-term will be a stylesheet or chain of stylesheets that create plain text from your HTML. The reason I say "chain of stylesheets" is that I think you'll find some problems, like rendering those tables, will be most tractable with multiple passes.

If it were me, I'd start with a two-pass approach. Leave aside XSL-FO: it specifies layout for formatted pages, and isn't designed to reflect the constraints of plain text. In my first pass, I'd render into plain text all information that has to be rendered in line, such as mapping <i>...</i> to *...* or whatever plain-text markup convention you decide. A second pass (or passes) would take care of line breaks, indents and so forth.

Even so I dare say you'll find that table layout is not trivial.

Other programmers have used Java to help with the trickier parts of this, which you should consider. An excellent article appears at http://www-106.ibm.com/developerworks/java/library/x-xmlist1/. Even if you can't use these exact tools, the architecture described is sound.

Good luck,
Wendell

At 01:25 PM 6/22/2004, you wrote:
Hi Wendell,

I can constrain HTML pages to be valid XML. So, the hard part is solved.
But still I don't know of a good solution to covert it to plain text. I
tried XSL FO with Apache FOP using IBM Developerworks XSL for converting
xhtml to fo
(http://www-106.ibm.com/developerworks/library/x-xslfo2app/). It does
proper conversion, but it has the following issues:
1) the formatting looks really bad. It has too much white space (most of
the words are separated by multiple space chars instead of 1).
2) If I change the font family, font size and line height as suggested
by the Apache FOP site, consequent lines are overwriting each other.
3) I had to specify the column width in pt. If by chance the column has
a word that does not fit into the given width, it is truncated instead
of wrapping.

Note: Some others that I have tried.
1) w3m does a good job. But it is C++ code and I cannot use it.
2) Redhat has some java classes, but their conversion is very primitive.
They don't format tables at all (each cell is rendered one after another
vertically instead of a grid-like rendering).

-- Radha


======================================================================
Wendell Piez                            
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================