RE: [xsl] unparsed-text() and illegal characters

The spec is very strict that characters not allowed in XML cause an error.
This is a change since the book was written.

However, the spec is very loose about how URIs are resolved. So a conformant
product could take the URI

thing.txt?substitute-illegal-chars=FFFD

as a reference to "the document formed by taking thing.txt and substituting
illegal characters with xFFFD."

Perhaps I'll do that.

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: Abel Braaksma Online [mailto:abel(_dot_)online(_at_)xs4all(_dot_)nl] 
Sent: 27 July 2006 19:10
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] unparsed-text() and illegal characters

Dear List,

Trying to "import" a non-XML file of an undefined encoding, I 
received the following error when using Saxon8: "The unparsed 
text file contains a character illegal in XML (line=1 
column=4 value=hex 11)". I only found one reference about 
this error 
(http://www.stylusstudio.com/xsllist/200510/post90470.html), 
which is actually a post about illegal characters inside the 
XSLT document.

Michael Kay points out in that post that this error is merged 
into XTDE1190 (see 
http://www.w3.org/TR/xslt20/#err-XTDE1190). It is claimed in 
the specs that non-understood characters or byte sequences 
should result in this non-recoverable dynamic error.

In his indispensable book, the  XSLT 2.0 Programmer's 
Reference, he states the following:
"Some processors will provide configuration options that pass 
this choice on the user. If the file contains characters that 
are invalid in XML (this applies to most control characters 
in the range x00 to x1F under XML 1.0, but only to the null 
character x00 under XML 1.1) then the invalid characters are 
substituted by the special Unicode character xFFFD, which is 
specifically intended for such purposes."

I understand that the book was written before XSLT 2.0 was 
finalized (it is still a Candidate), but I wonder if a 
treatment like above is still possible somehow. The contents 
of the file is ISO-8859-1, apart from the start and end 
header, which contain control characters. I only need the 
part that is parsable as text, the rest can be dismissed.

Am I asking too much from XSLT, or is this somehow possible? 
It would really add to the possibilities, and it means I 
don't need some extra filter or preparse step.

Cheers,
Abel Braaksma
www.nuntia.nl

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--