xsl-list
[Top] [All Lists]

nbsp is not that hard, folks

2002-11-08 00:12:57
Brian Grainger wrote:
If you're trying to escape   in a document encoded as
UTF-8, you have to use Unicode escaping of the UTF-8
representation of the entity. In this case,   is equal to
 , and   encoded as UTF-8 is \u00A0.

Good grief. No, you have your terminology badly mixed up, and you're throwing
in an irrelevant notation. " " " " and "\u00A0"  have nothing,
NOTHING to do with UTF-8. There is something about nbsp that just confuses the
heck out of people. I think it must be the fact that it looks like a space,
and that you don't have an nbsp key on your keyboard.

OK, read this.

1. There is a character -- an abstract unit in a "script" (a writing system;  
we are using Latin right now) -- called NO-BREAK SPACE by the Unicode Standard
and ISO/IEC 10646. Unicode and ISO/IEC 10646 assign this character an integer
number, 160, which is A0 in hex. We say Unicode all the time around here, but 
we mean ISO/IEC 10646 because that's what the XML and HTML specs reference. 
The two standards share the same character repertoire and numbering so there's 
no harm.

2. UTF-8 is an encoding scheme that provides a way of representing any of the
approximately 1.1 million possible abstract characters in Unicode as a
sequence of 1 to 4 bytes. The UTF-8 representation of the Unicode character
160 (no-break space), is the pair of bytes C2 A0, in that order. In contrast,
iso-8859-1 is a character map that provides a way of representing the first
256 Unicode characters as a single byte. us-ascii is an even more limited set 
of just the first 128, mapped to a single byte.

3. This thing:  \u00A0
  - is a sequence of 6 bytes (ASCII bytes for slash, u, zero, zero, A, zero);
  - has special meaning in a programming language like Java or Python,
     where it is essentially a macro for the no-break space character;
  - is used when representing the character directly as encoded bytes is
     impractical or impossible.

4. This thing:   
or this thing:   
  - is to SGML applications like HTML and XML what \u00A0 is to Java & Python;
  - is called a character reference (or "numeric character reference").

5. This thing:   
  - is to SGML applications like HTML and XML an "entity reference";
  - refers to an entity (a separate collection of information) named nbsp;
  - depending on the circumstances, is intended to be treated by the 
     XML parser or HTML user agent as equivalent to the entity's
     "replacement text";
  - is, in HTML, predefined to have the replacement text of just one 
     character, the no-break space;
  - is not defined by default in XML.

6. The thing here in between the quotes:   " "
  - is byte 0xA0;
  - is intended to be a no-break space because this email is iso-8859-1 
     encoded;
  - has exactly the same meaning in an XML document as  .

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list