xsl-list
[Top] [All Lists]

Re: Problems with entities

2003-11-19 14:12:23
cknell(_at_)onebox(_dot_)com wrote:

If you are using the UTF-8  encoding, for examle, the ó character is represented by 
ó

Actually, the encoding doesn't matter--what matters is the character set, which is always Unicode for XML.

That is, the character ó (Latin small letter o with acute) is that character in the Unicode character set regardless of the encoding.

Characters are an abstraction. A character set is nothing more than an arbitrary mapping of abstract characters to unique numbers by which those characters can be referenced. In Unicode, each character also has a unique name that can be used instead of the character code to refer to the character (although no all processors know how to resolve these names).

The encoding simply determines how the characters are written to disk as sequences of bytes. For example, in UTF-8 encoding this character is written as a single byte (because its code is less than 255, the point at which UTF-8 uses 3 or more bytes per character), but the UTF-16 encoding is written as two bytes because UTF-16 uses two bytes for each of the first 65K characters of the Unicode Basic Multilingual Plane. In both cases the character (the abstraction) is the same: lowercase o with acute.

To read an XML file, the XML processor must first read the sequence of bytes on disk and then interpret that byte sequence as a sequence of characters. Therefore, it must know the encoding because the same sequence of bytes may result in different characters (or be invalid) depending on the encoding it is interpreted as.

Cheers,

Eliot
--
W. Eliot Kimber
Innodata Isogen
eliot(_at_)isogen(_dot_)com
www.isogen.com


XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



<Prev in Thread] Current Thread [Next in Thread>