The purpose of using XML, or of using a standard at all, is that you
know that supplier and receiver understand the format and that you need
not worry about vendor-specific formats or deviations. XML is a very
free language but the standard does dictate that when any document is
not well-formed (and an encoding problem means it isn't), that a
processor *must* reject it with a fatal error. If you try to bypass that
it is like driving in a car with no breaks: some day you will hit a wall
and things will crash, and all you thought was that you were driving a
real car... it at least looked like one ;)
If you cannot fix the source (i.e., some proprietary legacy home-breed
XML-like format which you have to deal with regardless what a standard
dictates) it is best to find an agreement with your source of what
exactly the difference are (or can be) and agree upon that as strict as
you can. Then, decide how to deal with it. Ideally in your situation,
I'd choose for a single filter or a filter chain. Many existing workflow
systems have that, and if you don't, it's trivial to write one (but
don't use XSLT for it, because that expects XML, which you haven't got yet).
After you filter it and you transformed the wannabe XML into proper XML
you can start by transforming it with XSLT. Without any hassle, really.
There's only other option I can think of, which will basically come down
to the same thing in the end but maybe better extensible: write an
encoding parser, call it "almost-utf8", register it, and set the
encoding of your document to this home-breed encoding (<?xml
version="1.0" encoding="almost-utf8" />. The encoding is just equal to
any other UTF-8 except for these characters that you don't allow, which
you map to a space or whatever.
But all these methods are far from perfect compared to fixing it at the
source. What is the use of using a BS (BackSpace) character in your
document anyway?
Cheers,
-- Abel Braaksma
Waqar Ali wrote:
Sorry.. do not want to drag this topic but setting CheckCharacters to
false does not work.. Here what is written in the documentation:
"If the XmlReader is processing text data, it always checks that the
XML names and text content are valid, regardless of the property
setting. Setting CheckCharacters to false turns off character checking
for character entity references."
No matter what I do parser does not like this character and I have no
option but to somehow take it out from the xml.
Thanks guys for your help.
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--