RE: Identifying output from the (MS) xml parser

Hi Hugh,

I believe the output from the parser must be one of the following:
A result tree;
A wide (Unicode) string;
An ASCII (8bit) string;


The multiple choices here a slightly contradictory.  The transformation
will output either XML (which would equate to what you're calling a
result tree I assume), HTML or text according to the <xsl:output>
@method attribute (see also below).
But this is not really related to the encoding, i.e. whether the output
is UTF-16 or not.
The transformation engine (parser) will actually output either UTF-16
string (BSTR) or a stream - where that stream might be encoded as ASCII
or a multitude of other encodings.  But it will never output an 8-bit
ASCII string as such - that particular encoded output would have to go
into an output stream.

I believe which of these is produced will be determined by the
<xsl:output> element.


Yes, the output is determined by <xsl:output> element - if this is
present (i.e. it is not obligatory for that element to be there in the
stylesheet nor are any/all of the deciding attributes)...

The method is determined by the @method attribute (if present) of the
<xsl:output> element.  If the @method attribute is omitted then the
transformation engine will use defaults - which means the output will
either be XML or HTML (see http://www.w3.org/TR/xslt#output).  The
default is to output XML unless the first output element is named <HTML>
(in any case combination) in which case it assumes the output is HTML.

The encoding is determined by the @encoding attribute (if present) of
the <xsl:output> element.  If not specified then the default is always
UTF-16.
But there is a big gotcha with this (and the cause of the biggest FAQ
qith MSXMLs) - in that even if the @encoding attribute is specified you
may still end up with UTF-16 output depending on which methods you used
to perform the transformation:-
1) if you use .transformNode() method then the output will always be
UTF-16 because the result of that method is a BSTR - so it must, by
nature, be encoded as UTF-16.
2) if you use the .transformNodeToObject() method then the output will
be UTF-16 if the second parameter of that method call is a DOM object.
But if the second parameter is a stream object (i.e. one that supports a
.write() method) then the output will be encoded according to the
encoding specified by the @encoding attribute
3) if you use the IXSLProcessor/IXSLTemplate interfaces to perform the
transformation then it depends on 'how' you use these interfaces to
determine whether you will get UTF-16 or some other encoding specified
by the @encoding attribute.  This is because the .output property of the
IXSLProcessor interface can be set prior to transformation or just read
after transformation.  If the .output property is assigned prior to
transformation with a stream object then the stream will be written to
in the specified encoding.  But if the .output property is only read
after the transformation then the output will always be UTF-16 because
the .output property, in this case, can only contain a BSTR.

If you don't know what the user is delivering to you to be transformed
then it is probably best to use the IXSLProcessor/IXSLTemplate
interfaces to perform the transformation - and set the .output property
to a stream object prior to the transformation.

Do I have to interrogate the style sheet to find this information?


This is probably unwise to do as the first step.  It may be something
that you could utilize to clarify things - but this would have to be a
matter of elimination and detection followed by a checking of what may
have been specified on the <xsl:output> element.

If so, can I assume that the <xsl:stylesheet> element is the second

node

in the xslt tree, or at least a top level element?


Not really - the <xsl:stylesheet>, if used, must be the root element but
the stylesheet might not contain an <xsl:stylesheet> element at all (see
http://www.w3.org/TR/xslt#result-element-stylesheet).

If the output method is "xml" or "html" the output must be a result
tree/xml document.


Not really - bear in mind that the whole point of the HTML output method
is to be able to generate HTML which may not constitute well-formed XML.

I have also seen an attribute of <output> called media-type, but have
not been able to find any documentation on this.  Can anyone comment

on

this?


The documentation on this is in the spec - but I don't think it has any
great impact on what you are doing.  With MSXML the only time you will
see any impact of this is when you use an output @method of HTML where
the media type is placed in the @content attribute of a <META> tag.

If the <output> "method" is neither "xml" not "html", then I assume

the

output is a character stream.


It is as well not to confuse in any way the output method and encoding -
as they are, for the most part, unrelated.

Whether this uses 16 bit or 8
bit/multibyte characters will depend upon the "encoding" attribute.

Is

there a concise list of the "encoding" values that result in

characters

of a particular size, or some other way to determine this information?


For MSXML there isn't even a concise list of encodings that are
supported - because this will vary from machine to machine - depending
on what language packs etc. are present on that machine.
But whether a particular encoding is 16-bit or 8-bit is probably going
to be a distraction rather than a help - in that you won't want to be
writing code that copes with all encodings when there are Windows APIs
that will help you convert everything (i.e. take a look at the
MultiByteToWideChar() and WideCharToMultiByte() API) - but in order to
use these you will need to ascertain the encoding of the output.

My user supplies both xml and xslt input.
The input may generate a new xml document, or a flat file.
I am trying to determine what comes out of the ms parser.  
Could someone(s) please advice me of the accuracy, or otherwise, of

the

following statements?


This would all depend on what you then want to do with the output.  If
you are just going to save the output to, say for example, a file then
it shouldn't matter the encoding - just save the file as is from an
output stream.  You may, of course, need to figure out the output type
(XML, HTML or text) in order to determine the best file extension to
give the output file.

You might also be as well to look into BOMs (Byte Order Markers) - as
these, if present on the output, will give you good indications of the
encoding that was used for the output (see also
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing and
http://www.unicode.org).

Hope this helps
Marrow
http://www.marrowsoft.com - home of Xselerator (XSLT IDE and debugger)
http://www.topxml.com/Xselerator


-----Original Message-----
From: owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
[mailto:owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com] On Behalf Of 
Hugh Dixon
Sent: 10 December 2002 03:34
To: XSL-List(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] Identifying output from the (MS) xml parser

I am writing some C++ code to run under windows, using the MSXML DOM
implementation.
My user supplies both xml and xslt input.
The input may generate a new xml document, or a flat file.
I am trying to determine what comes out of the ms parser.  
Could someone(s) please advice me of the accuracy, or otherwise, of the
following statements?

I believe the output from the parser must be one of the following:
A result tree;
A wide (Unicode) string;
An ASCII (8bit) string;

I believe which of these is produced will be determined by the
<xsl:output> element.
Do I have to interrogate the style sheet to find this information?

If so, can I assume that the <xsl:stylesheet> element is the second node
in the xslt tree, or at least a top level element?

I believe the <xsl:output> element can only be a direct child (topmost
element) of the <xsl:stylesheet> element.  Could someone confirm this?

If the output method is "xml" or "html" the output must be a result
tree/xml document.

I have also seen an attribute of <output> called media-type, but have
not been able to find any documentation on this.  Can anyone comment on
this?

If the <output> "method" is neither "xml" not "html", then I assume the
output is a character stream.  Whether this uses 16 bit or 8
bit/multibyte characters will depend upon the "encoding" attribute.  Is
there a concise list of the "encoding" values that result in characters
of a particular size, or some other way to determine this information?

Thanks!!!

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list





 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list