Brian Grainger wrote:
If you're trying to escape in a document encoded as
UTF-8, you have to use Unicode escaping of the UTF-8
representation of the entity. In this case, is equal to
 , and   encoded as UTF-8 is \u00A0.
Good grief. No, you have your terminology badly mixed up, and you're throwing
in an irrelevant notation. " " " " and "\u00A0" have nothing,
NOTHING to do with UTF-8. There is something about nbsp that just confuses the
heck out of people. I think it must be the fact that it looks like a space,
and that you don't have an nbsp key on your keyboard.
OK, read this.
1. There is a character -- an abstract unit in a "script" (a writing system;
we are using Latin right now) -- called NO-BREAK SPACE by the Unicode Standard
and ISO/IEC 10646. Unicode and ISO/IEC 10646 assign this character an integer
number, 160, which is A0 in hex. We say Unicode all the time around here, but
we mean ISO/IEC 10646 because that's what the XML and HTML specs reference.
The two standards share the same character repertoire and numbering so there's
no harm.
2. UTF-8 is an encoding scheme that provides a way of representing any of the
approximately 1.1 million possible abstract characters in Unicode as a
sequence of 1 to 4 bytes. The UTF-8 representation of the Unicode character
160 (no-break space), is the pair of bytes C2 A0, in that order. In contrast,
iso-8859-1 is a character map that provides a way of representing the first
256 Unicode characters as a single byte. us-ascii is an even more limited set
of just the first 128, mapped to a single byte.
3. This thing: \u00A0
- is a sequence of 6 bytes (ASCII bytes for slash, u, zero, zero, A, zero);
- has special meaning in a programming language like Java or Python,
where it is essentially a macro for the no-break space character;
- is used when representing the character directly as encoded bytes is
impractical or impossible.
4. This thing:  
or this thing:  
- is to SGML applications like HTML and XML what \u00A0 is to Java & Python;
- is called a character reference (or "numeric character reference").
5. This thing:
- is to SGML applications like HTML and XML an "entity reference";
- refers to an entity (a separate collection of information) named nbsp;
- depending on the circumstances, is intended to be treated by the
XML parser or HTML user agent as equivalent to the entity's
"replacement text";
- is, in HTML, predefined to have the replacement text of just one
character, the no-break space;
- is not defined by default in XML.
6. The thing here in between the quotes: " "
- is byte 0xA0;
- is intended to be a no-break space because this email is iso-8859-1
encoded;
- has exactly the same meaning in an XML document as  .
- Mike
____________________________________________________________________________
mike j. brown | xml/xslt: http://skew.org/xml/
denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list