http://www.w3.org/People/Raggett/tidy/
Have you tried using tidy?
-----Original Message-----
From: Ted Stresen-Reuter [mailto:tedmasterweb(_at_)mac(_dot_)com]
Sent: Tuesday, March 11, 2003 4:37 PM
To: xsl-List(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] How to Handle Bad XML (or Word HTML)
Hi,
Thanks again to everyone who answers on this list. You've all been
really sweet.
Today's question hopes to try and tackle a transformation of the HTML
produced by MS Word into a valid XHTML format.
In general, the problem is Word doesn't produce "valid" XML
(specifically, for many elements, attributes are not quoted). The file
I'm working with starts with the following:
<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40">
Additionally, a typical element might look like this:
<p class=MsoNormal style='text-align:justify;mso-hyphenate:none'><![if
!supportEmptyParas]> <![endif]><o:p></o:p></p>
Is it even possible to use such a document as a source document and if
so, how do I handle errors returned by the XSLT processor when unquoted
attributes are found?
Thanks again to all of you who take the time to read and actually
answer these queries.
Sincerely,
Ted Stresen-Reuter
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
This electronic transmission is strictly confidential to Smith & Nephew and
intended solely for the addressee. It may contain information which is
covered by legal, professional or other privilege. If you are not the
intended addressee, or someone authorized by the intended addressee to
receive transmissions on behalf of the addressee, you must not retain,
disclose in any form, copy or take any action in reliance on this
transmission. If you have received this transmission in error, please
notify the sender as soon as possible and destroy this message.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list