xsl-list
[Top] [All Lists]

RE: How to Handle Bad XML (or Word HTML)

2003-03-11 14:57:24
The best bet is to use HTML Tidy to tidy it up:
http://tidy.sourceforge.net

Tidy even has a mode for specifically for MS-Word.

Also note that Word in Office 11 (currently in Beta 2) supports
round-tripping of documents as well-formed XML.

-----Original Message-----
From: Ted Stresen-Reuter [mailto:tedmasterweb(_at_)mac(_dot_)com]
Sent: Tuesday, March 11, 2003 1:37 PM
To: xsl-List(_at_)lists(_dot_)mulberrytech(_dot_)com

Hi,

Thanks again to everyone who answers on this list. You've all been
really sweet.

Today's question hopes to try and tackle a transformation of the HTML
produced by MS Word into a valid XHTML format.

In general, the problem is Word doesn't produce "valid" XML
(specifically, for many elements, attributes are not quoted). The file
I'm working with starts with the following:

<html xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:w="urn:schemas-microsoft-com:office:word"
xmlns="http://www.w3.org/TR/REC-html40";>

Additionally, a typical element might look like this:

<p class=MsoNormal style='text-align:justify;mso-hyphenate:none'><![if
!supportEmptyParas]>&nbsp;<![endif]><o:p></o:p></p>

Is it even possible to use such a document as a source document and if
so, how do I handle errors returned by the XSLT processor when
unquoted
attributes are found?

Thanks again to all of you who take the time to read and actually
answer these queries.

Sincerely,

Ted Stresen-Reuter


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



<Prev in Thread] Current Thread [Next in Thread>