xsl-list
[Top] [All Lists]

Re: [xsl] Re: XML/XHTML fragment to text

2007-08-16 07:34:56
A couple of corrections on my previous statements on calculating the size of the output byte stream, see below:


Abel Braaksma wrote:

It also correctly gives < as 4 characters when it is part of a text node or an attribute.

I meant: 4 bytes.

It *does not* correctly interpret cdata-section-elements on the xsl:output definition, but that's only a minor inconvenience (and an insignificant little bug in Saxon)

My mistake. I was trapped (again!) in the html-start-element-defaults-to-html-output-method atrocity of XSLT (not corrected in XSLT 2.0, probably for backward compatibility). This led to the cdata-section-elements attribute being ignored. Changing the method to "xml" or "xhtml" fixed the problem (as does changing the start element into something not <html>).

it does correctly interpret the omit-xml-declaration yes/no.

And it correctly interprets all other kinds of stuff of the xsl:output. I.e., I tested the following attributes that can alter the serialization:

* cdata-section-elements
* doctype-public
* doctype-system
* encoding
* exclude-result-prefixes
* include-content-type
* indent
* media-type
* method
* normalization-form
* omit-xml-declaration
* standalone
* use-character-maps
* disable-output-escaping


I did not test the following though:

* escape-uri-attributes
* extension-element-prefixes (does not influence the outcome when used on xsl:output)
* undeclare-prefixes (only for XML 1.1 anyway)
* use-when (does not influence output, but switches the instruction on/off)
* version
* xml:space
* xsl:version (might influence the way things are escaped)

One attribute on xsl:output causes problems always, as far as I could tell, which is the following:

* byte-order-mark

When you use it together with UTF-8 it will offset the amount by one. This is because the byte order mark (xFEFF), when interpreted as a string, will be translated into the equivalent string representation in UTF-8, which is the byte sequence xEFBBBF, now representing the codepoint 65279 (U+FEFF) (Zero Width No Break Space, deprecated but allowed). This interpretation is in lieu of the Unicode recommendation. It is useless to put a BOM at the beginning of a UTF-8 stream, so it is best to avoid it.


You must be careful that the selected encodings match. If they don't, the string-to-hexBinary function will proof leading (logically so).

This was incorrect. The string will be radically different when, for instance, it is encoded in US-ASCII, and anything encoded in US-ASCII will always have the same representation in string-to-hexBinary if you use any of the non-IBM encodings, including UTF-8. In UTF-16 it will double, of course.

Consider the following (extreme) example:
<xsl:output name="output-def" method="xml" encoding="US-ASCII" cdata-section-elements="p" />

<xsl:template name="main">
<xsl:variable name="result-tree"><p>resumé's</p></xsl:variable>
<xsl:variable name="serialized" select="saxon:serialize($result-tree, 'output-def')" /> <xsl:variable name="hexBin" select="saxon:string-to-hexBinary($serialized, 'UTF-8')" /> <xsl:variable name="length" select="string-length(xs:string($hexBin)) div 2" />

....
</xsl:template>

Normally, the output in $serialized would look like the following:

<p><![CDATA[resumé's]]></p>

But, because of the low encoding chosen, the serializer must remove the é character from the CDATA section, with the following as a result:

<p><![CDATA[resum]]>&#233;<![CDATA['s]]></p>

Obviously, the lengths are quite different. The string size of the first is 28 and the second is 44. The UTF-8 byte sequence of the first is 28 (because of the interpretation of é) and 44 in the second (because US-ASCII is 100% compatible with the 1-byte sequences of UTF-8).

No need to say that, apart from this extreme, it will be very hard to find out all the possible other ways that the serializer will use to output a conformant byte stream. I must admit that I've found this approach very refreshing and using this saxon-specific extension, it comes pretty close to finding the exact byte length of the document (or segment) *after* serialization (including white space, indentation, escaping etc).

Thanks for the exercise ;)

Cheers,
-- Abel Braaksma

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--