Being able to constrain your incoming HTML will be a big help.
Given that, I think the easiest approach long-term will be a stylesheet or
chain of stylesheets that create plain text from your HTML. The reason I
say "chain of stylesheets" is that I think you'll find some problems, like
rendering those tables, will be most tractable with multiple passes.
If it were me, I'd start with a two-pass approach. Leave aside XSL-FO: it
specifies layout for formatted pages, and isn't designed to reflect the
constraints of plain text. In my first pass, I'd render into plain text all
information that has to be rendered in line, such as mapping <i>...</i> to
*...* or whatever plain-text markup convention you decide. A second pass
(or passes) would take care of line breaks, indents and so forth.
Even so I dare say you'll find that table layout is not trivial.
Other programmers have used Java to help with the trickier parts of this,
which you should consider. An excellent article appears at
http://www-106.ibm.com/developerworks/java/library/x-xmlist1/. Even if you
can't use these exact tools, the architecture described is sound.
At 01:25 PM 6/22/2004, you wrote:
I can constrain HTML pages to be valid XML. So, the hard part is solved.
But still I don't know of a good solution to covert it to plain text. I
tried XSL FO with Apache FOP using IBM Developerworks XSL for converting
xhtml to fo
(http://www-106.ibm.com/developerworks/library/x-xslfo2app/). It does
proper conversion, but it has the following issues:
1) the formatting looks really bad. It has too much white space (most of
the words are separated by multiple space chars instead of 1).
2) If I change the font family, font size and line height as suggested
by the Apache FOP site, consequent lines are overwriting each other.
3) I had to specify the column width in pt. If by chance the column has
a word that does not fit into the given width, it is truncated instead
Note: Some others that I have tried.
1) w3m does a good job. But it is C++ code and I cannot use it.
2) Redhat has some java classes, but their conversion is very primitive.
They don't format tables at all (each cell is rendered one after another
vertically instead of a grid-like rendering).
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
Mulberry Technologies: A Consultancy Specializing in SGML and XML