RE: [xsl] Tranformation failed with Saxon for "Illegal HTML character"

Hi,

From Alan Wood's excellent resource, it's the ANSI character set, also known 
as Windows-1252:


  http://www.alanwood.net/demos/charsetdiffs.html#a

From my publishing experience with external parties supplying XML documents, I 
can add a few things I've learned (the hard way) regarding misencoded files 
and "illegal" characters:


1. An XML file's encoding declaration is only a HINT at what is actually in the 
file. The declaration, in itself, does not guarantee the encoding, nor does it 
force any re-encoding of the file in any way whatsoever. It is all too easy to 
make a simple edit to a correctly encoded, correctly declared UTF-8 XML file, 
and mistakenly save the file using the editor's default encoding of say, 
ISO-8859-1 or Windows-1252. The result: a corrupted file with now "illegal" 
UTF-8 characters, even though the declaration still says "encoding=UTF-8". Pass 
this file off to another process and, if undetected, the illegal character 
propagates to other systems, possibly compounding the problem.

2. Fortunately, most XML editors are aware of such encoding issues and will do 
the right thing when the file is saved. Unfortunately, from what I see, many 
text editors, server processes, database and email character set defaults, and 
browser default encodings--all resort to a default encoding of ISO-8859-1. So, 
one must be very careful that every step in the process agrees on the encoding: 
from file creation to editing, to local storage on a file system, to server 
processes, to database storage, to more server and email processes, to viewing 
in a browser.

3. Because of all the above hard lessons that were learned, I firmly believe 
that "... the sooner you catch misencoded files (or files whose encoding is 
misdeclared), the better it is for the user in the long run."

4. Just "Encode Your Documents in UTF-8" (Elliotte Harold):

  http://www-128.ibm.com/developerworks/xml/library/x-utf8/index.html

Regards,
Mike Waters
Springer
Content Management | Content Technologies
233 Spring St. | New York, NY 10013 | USA
mike(_dot_)waters(_at_)springer(_dot_)com
www.springeronline.com
www.springerlink.com

-----Original Message-----
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com]
Sent: Friday, July 28, 2006 5:41 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc: desantis(_at_)objectway(_dot_)it
Subject: RE: [xsl] Tranformation failed with Saxon for "Illegal HTML
character"

The Euro symbol is not decimal 128 in Unicode. It is decimal 
128 in some Microsoft character set whose name I have 
forgotten. The Unicode character 128 is not a legal HTML character.

You need to make sure that the character encoding of the XML 
file is correctly declared: if you are using a particular 
Microsoft codepage, then you need to say so in the XML declaration.

There was a significant controversy in W3C about the rule that 
invalid HTML characters must be treated as a fatal error by 
XSLT processors. I argued for leniency, but the view that 
prevailed was that the sooner you catch misencoded files (or 
files whose encoding is misdeclared), the better it is for the 
user in the long run. 

Michael Kay
http://www.saxonica.com/