with this input
<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>
I have this XML output that you can clean up with XSLT
<sample>
<p>Some <emphasis>stuff</emphasis> that should be cleaned.</p>
<paragraph>More <strong>stuff.</strong></paragraph>
<p>Yet more.</p>
<paragraph>Stuff.</paragraph>
</sample>
Using this XML control file:
<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE convert2xml SYSTEM "c:\d\xml\convert2xml.dtd" >
<!--
file: HTML-cleanup.ctl
Purpose: Control file for c2x program
Author: jaa
Date: 20020124
Clean up dirty HTML and make it into good XML
-->
<convert2xml>
<root-element name="sample">
</root-element>
<recognize-element name="paragraph">
<start-token>
<pattern>\pp</pattern>
<before>
</before>
</start-token>
<end-token>
<pattern>
</p></pattern>
</end-token>
<allowed-child ref="emphasis"/>
<allowed-child ref="strong"/>
</recognize-element>
<recognize-element name="p">
<start-token>
<pattern><p>
</pattern>
<before>
</before>
</start-token>
<start-token>
<pattern><p></pattern>
<before>
</before>
</start-token>
<end-token>
<pattern></p></pattern>
</end-token>
<end-token>
<pattern><b>
</p></pattern>
</end-token>
<end-token>
<pattern><br/>
</pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br/>
</p></pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br>
</p>
</pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br/></pattern>
<parsed-after>\pp</parsed-after>
</end-token>
<end-token>
<pattern><br></pattern>
</end-token>
<end-token>
<pattern>
</p></pattern>
</end-token>
<allowed-child ref="emphasis"/>
<allowed-child ref="strong"/>
</recognize-element>
<recognize-element name="emphasis">
<start-token>
<pattern><i></pattern>
</start-token>
<end-token>
<pattern></i></pattern>
</end-token>
<end-token>
<pattern></i>
</pattern>
<after> </after>
</end-token>
</recognize-element>
<recognize-element name="strong">
<start-token>
<pattern><b></pattern>
</start-token>
<end-token>
<pattern></b></pattern>
</end-token>
<end-token>
<pattern></b>
</pattern>
</end-token>
</recognize-element>
</convert2xml>
In a free program called C2X -- convert to XML.
Ask me off list if you want more info as C2X is off topic.
Date: Thu, 23 Jan 2003 21:54:43 +0100
From: Ole Sandum <osandum(_at_)bigfoot(_dot_)com>
Subject: [xsl] cleaning up ill-structured html
Example:
<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>
Should become:
<p>Some <i>stuff</i> that should be cleaned.</p>
<p>More <b>stuff.</b></p>
<p>Yet more.</p>
<p>Stuff.</p>
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list