xsl-list
[Top] [All Lists]

RE: doctype declaration and msxmldom

2003-06-20 05:55:28
Okay. I may have been off on the encoding thing.  It was just a
suggestion of another path to look at.
But DOCTYPE requires a DTD.

-----Original Message-----
From: owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
[mailto:owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com] On Behalf Of 
Mike Brown
Sent: Thursday, June 19, 2003 8:51 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] doctype declaration and msxmldom


The post to which I'm replying had nothing direclty to do with XSLT, but
I feel compelled to respond, because the information in it is rife with
errors, and because I'm obsessed with character encoding.

Nancy Pate wrote:
I work with SGML.  When you declare "DOCTYPE" the composing/processing

engine is going to expect a DTD.

Do not try to divine the XML parsing model or the XSLT processing model
just based on the default, apparent behavior of your favorite toolsets
and their usually less-than-thorough documentation.

I don't know what the requirements are for SGML parsers, but XML parsers
have much leeway as to when they are required to read a DTD, and what
parts of the DTD they must read (for example, external parts are
optional).

More importantly, the XML parser's user has control over whether the
parser tries to validate or not. And the parser (say, Expat), can be set
to do things like read external entities but not external DTDs, allowing
situations where you can still parse a document that contains an entity
reference without a corresponding entity declaration, so long as the
standalone declaration agrees.

Furthermore, XML document authors have flexibility in what they can do..
for example <!DOCTYPE blah> is legal even though it does not contain any
DTD info at all.

 Can you declare the necessary encoding
in the XML declaration (<?xml version="1.0" encoding="ISO-8858-1"?>)

ISO-8859-1. And an encoding declaration is an informative hint to the
XML parser to tell it how the *bytes* of the document (think of what you
see if you look at the document in a hex editor rather than a text
editor) should be converted to Unicode characters as it is read in.

There is only one correct encoding that you can declare: the one
actually used for producing the bytes that comprise that particular
document. It has to be accurate, or "close enough" in the case of, say,
a US-ASCII encoded document being declared as UTF-8. You cannot just
make it up.

and then use the Unicode number?

"using the Unicode number" in more correct terminology is "using a
(numeric) character reference" like "&#232;" or "&#xE8;"

By definition, a character reference always uses Unicode code points. So
"&#232;" or "&#xE8;" are both referring to Unicode character number 232
(decimal), which happens to be the small Latin letter e with grave
accent.

When using a character reference, the fact that the document was encoded
with whatever encoding was used is irrelevant. &#232; always means
Unicode character at code point 232, never "byte 232 in encoding XYZ",
unless you are using that nonconformant abomination known as Netscape
Navigator (or
Communicator) version 4.

 I have a table that says that &egrave; has
a UTC code of #x00E8

To hopefully clear up your confusion with more correct terminology...

The predefined HTML entity named "egrave" has as its replacement text
the actual character number E8 (hexadecimal) of the Universal Character
Set (UCS): small Latin letter e with grave accent. 

You can more or less think of entities as text macros, although every
document or binary 'file' is on some level an entity, so it's not a
perfect analogy. Please try to distinguish between a named "entity
reference" and a numeric "character reference" though. Then you can get
creative and say "character entity reference" when you mean things like
"&egrave;" so long as the egrave entity's replacement text is a single
character.

The UCS is the normative basis of SGML, HTML, and XML, and is defined by
ISO/IEC 10646, the international standard that assigns numbers to the
idea of nearly every character used in nearly every written language
script on the planet. This standard is often informally referred to as
Unicode because it is developed in tandem with and shares its character
assignments with The Unicode Standard, a more thorough but perhaps less
political publication that does not fall under the ISO's jurisdiction.

UTC (what you said) means Greenwich/Zulu time zone, pretty much...

-Mike

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list