Re: [xsl] Re: XML/XHTML fragment to text

A couple of corrections on my previous statements on calculating thesize of the output byte stream, see below:



Abel Braaksma wrote:

It also correctly gives < as 4 characters when it is part of a textnode or an attribute.


I meant: 4 bytes.

It *does not* correctly interpret cdata-section-elements on thexsl:output definition, but that's only a minor inconvenience (and aninsignificant little bug in Saxon)

My mistake. I was trapped (again!) in thehtml-start-element-defaults-to-html-output-method atrocity of XSLT (notcorrected in XSLT 2.0, probably for backward compatibility). This led tothe cdata-section-elements attribute being ignored. Changing the methodto "xml" or "xhtml" fixed the problem (as does changing the startelement into something not <html>).

it does correctly interpret the omit-xml-declaration yes/no.

And it correctly interprets all other kinds of stuff of the xsl:output.I.e., I tested the following attributes that can alter the serialization:


* cdata-section-elements
* doctype-public
* doctype-system
* encoding
* exclude-result-prefixes
* include-content-type
* indent
* media-type
* method
* normalization-form
* omit-xml-declaration
* standalone
* use-character-maps
* disable-output-escaping


I did not test the following though:

* escape-uri-attributes

* extension-element-prefixes (does not influence the outcome when usedon xsl:output)

* undeclare-prefixes (only for XML 1.1 anyway)
* use-when (does not influence output, but switches the instruction on/off)
* version
* xml:space
* xsl:version (might influence the way things are escaped)

One attribute on xsl:output causes problems always, as far as I couldtell, which is the following:


* byte-order-mark

When you use it together with UTF-8 it will offset the amount by one.This is because the byte order mark (xFEFF), when interpreted as astring, will be translated into the equivalent string representation inUTF-8, which is the byte sequence xEFBBBF, now representing thecodepoint 65279 (U+FEFF) (Zero Width No Break Space, deprecated butallowed). This interpretation is in lieu of the Unicode recommendation.It is useless to put a BOM at the beginning of a UTF-8 stream, so it isbest to avoid it.

You must be careful that the selected encodings match. If they don't,the string-to-hexBinary function will proof leading (logically so).

This was incorrect. The string will be radically different when, forinstance, it is encoded in US-ASCII, and anything encoded in US-ASCIIwill always have the same representation in string-to-hexBinary if youuse any of the non-IBM encodings, including UTF-8. In UTF-16 it willdouble, of course.


Consider the following (extreme) example:

<xsl:output name="output-def" method="xml" encoding="US-ASCII"cdata-section-elements="p" />


<xsl:template name="main">
<xsl:variable name="result-tree"><p>resumé's</p></xsl:variable>

<xsl:variable name="serialized" select="saxon:serialize($result-tree,'output-def')" /><xsl:variable name="hexBin"select="saxon:string-to-hexBinary($serialized, 'UTF-8')" /><xsl:variable name="length" select="string-length(xs:string($hexBin))div 2" />


....
</xsl:template>

Normally, the output in $serialized would look like the following:

<p><![CDATA[resumé's]]></p>

But, because of the low encoding chosen, the serializer must remove theé character from the CDATA section, with the following as a result:


<p><![CDATA[resum]]>&#233;<![CDATA['s]]></p>

Obviously, the lengths are quite different. The string size of the firstis 28 and the second is 44. The UTF-8 byte sequence of the first is 28(because of the interpretation of é) and 44 in the second (becauseUS-ASCII is 100% compatible with the 1-byte sequences of UTF-8).

No need to say that, apart from this extreme, it will be very hard tofind out all the possible other ways that the serializer will use tooutput a conformant byte stream. I must admit that I've found thisapproach very refreshing and using this saxon-specific extension, itcomes pretty close to finding the exact byte length of the document (orsegment) *after* serialization (including white space, indentation,escaping etc).


Thanks for the exercise ;)

Cheers,
-- Abel Braaksma

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--