Hi,
From Alan Wood's excellent resource, it's the ANSI character set, also known
as Windows-1252:
http://www.alanwood.net/demos/charsetdiffs.html#a
From my publishing experience with external parties supplying XML documents, I
can add a few things I've learned (the hard way) regarding misencoded files
and "illegal" characters:
1. An XML file's encoding declaration is only a HINT at what is actually in the
file. The declaration, in itself, does not guarantee the encoding, nor does it
force any re-encoding of the file in any way whatsoever. It is all too easy to
make a simple edit to a correctly encoded, correctly declared UTF-8 XML file,
and mistakenly save the file using the editor's default encoding of say,
ISO-8859-1 or Windows-1252. The result: a corrupted file with now "illegal"
UTF-8 characters, even though the declaration still says "encoding=UTF-8". Pass
this file off to another process and, if undetected, the illegal character
propagates to other systems, possibly compounding the problem.
2. Fortunately, most XML editors are aware of such encoding issues and will do
the right thing when the file is saved. Unfortunately, from what I see, many
text editors, server processes, database and email character set defaults, and
browser default encodings--all resort to a default encoding of ISO-8859-1. So,
one must be very careful that every step in the process agrees on the encoding:
from file creation to editing, to local storage on a file system, to server
processes, to database storage, to more server and email processes, to viewing
in a browser.
3. Because of all the above hard lessons that were learned, I firmly believe
that "... the sooner you catch misencoded files (or files whose encoding is
misdeclared), the better it is for the user in the long run."
4. Just "Encode Your Documents in UTF-8" (Elliotte Harold):
http://www-128.ibm.com/developerworks/xml/library/x-utf8/index.html
Regards,
Mike Waters
Springer
Content Management | Content Technologies
233 Spring St. | New York, NY 10013 | USA
mike(_dot_)waters(_at_)springer(_dot_)com
www.springeronline.com
www.springerlink.com
-----Original Message-----
From: Michael Kay [mailto:mike(_at_)saxonica(_dot_)com]
Sent: Friday, July 28, 2006 5:41 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc: desantis(_at_)objectway(_dot_)it
Subject: RE: [xsl] Tranformation failed with Saxon for "Illegal HTML
character"
The Euro symbol is not decimal 128 in Unicode. It is decimal
128 in some Microsoft character set whose name I have
forgotten. The Unicode character 128 is not a legal HTML character.
You need to make sure that the character encoding of the XML
file is correctly declared: if you are using a particular
Microsoft codepage, then you need to say so in the XML declaration.
There was a significant controversy in W3C about the rule that
invalid HTML characters must be treated as a fatal error by
XSLT processors. I argued for leniency, but the view that
prevailed was that the sooner you catch misencoded files (or
files whose encoding is misdeclared), the better it is for the
user in the long run.
Michael Kay
http://www.saxonica.com/