I dropped to a two stage.
1. Java to read, check for well-formed only and save to disk
That kills non well-formed content.
2. RSS documents often contain escaped markup hidden inside
Not in CDATA sections either. See Tim Brays blog.
I tend to do the following:
If well-formed get escaped html as a string, tidy escaped html.
If escaped html is well-formed save escaped html body in same element it
was gotten from, if not well-formed strip escaped html tags save
resultant string to element it was gotten from.
Send the whole to intermediary xslt which can output a single newsfeed
format, as my architecture handles variant rss flavors, as well as other
The whole rss thing really pisses me off because of the FUD that using
escaped markup is in some way a sound design decision because it makes
it easier for users.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list