Interesting problem.
The thing that seems to make a dramatic difference to Saxon's speed on this
transformation is to introduce a variable for the expression $parents/*,
thus:
<xsl:template name="subtree">
<xsl:param name="parents"/>
<xsl:variable name="children" select="$parents/*"/>
<xsl:for-each select="$children">
<xsl:variable name="current-group"
select="$children[name() = name(current())]"/>
<xsl:if test="count($current-group[1] | .) = 1">
The surprising thing in this stylesheet is that the number of recursive
calls is not actually very high - it's the same as the number of elements
output to the result tree, which doesn't increase greatly as the input file
grows. But the size of $parents is linear with the source file size, which
means that calculating $parents/* within the for-each loop is really bad
news. I imagine that your rearrangement of the code somehow enabled MSXML to
spot that it could move the subexpression out of the for-each loop; why it
should do so on the rearranged code and not on the original is a mystery.
Saxon is currently doing this kind of optimization at the XPath (and XQuery)
level but not at the XSLT level.
Michael Kay
# -----Original Message-----
# From: owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
# [mailto:owner-xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com] On Behalf Of
Ian Young
# Sent: 18 March 2004 22:30
# To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
# Subject: [xsl] Efficient recursive grouping in XSLT 1.0?
#
# I have some fairly sizeable XML files (about 5MB - 15MB) that
# I want to extract the element name structure from. By that, I
# mean that I'll feed an arbitrary XML document in one end, and
# get out an XML document that contains one element for every
# distinct list of name() values in the ancestor-or-self axis,
# retains the position in document order of the first
# occurrence. I'd also like a count for each element output.
# For the moment, I'm only interested in the elements:
# attributes and text content are not important.
#
# For example, if I have this input document:
# <root>
# <a>
# <b/><c/><d/><b/>
# </a>
# <b>
# <a>
# <a/>
# </a>
# </b>
# <a>
# <a/>
# <b>
# <a/>
# <d/>
# <a/>
# <a/>
# </b>
# </a>
# </root>
#
# I would want output that looks like this:
# <root ct="1">
# <a ct="2">
# <b ct="3">
# <a ct="3"/>
# <d ct="1"/>
# </b>
# <c ct="1"/>
# <d ct="1"/>
# <a ct="1"/>
# </a>
# <b ct="1">
# <a ct="1">
# <a ct="1"/>
# </a>
# </b>
# </root>
#
# In XSLT 2.0 this can be readily achieved by using
# for-each-group recursively. With Saxon 7.9 this stylesheet
# will generate a result for one of my 7MB files (400k
# elements) in 2 or 3 seconds (mostly parsing the input
# document) to produce the output of about 1k elements.
#
# <xsl:stylesheet version="2.0"
# xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:output method="xml" version="1.0" encoding="UTF-8"
# indent="yes"/>
#
# <xsl:template match="/">
# <xsl:call-template name="subtree">
# <xsl:with-param name="parents" select="/"/>
# </xsl:call-template>
# </xsl:template>
#
# <xsl:template name="subtree">
# <xsl:param name="parents"/>
# <xsl:for-each-group select="$parents/*" group-by="name()">
# <xsl:copy>
# <xsl:attribute name="ct">
# <xsl:value-of select="count(current-group())"/>
# </xsl:attribute>
# <xsl:call-template name="subtree">
# <xsl:with-param name="parents" select="current-group()"/>
# </xsl:call-template>
# </xsl:copy>
# </xsl:for-each-group>
# </xsl:template>
# </xsl:stylesheet>
#
# This is great, except that the system I want this to run in
# is using libxml and libxslt (consequently, XSLT 1.0).
# Now, transforming this into an XSLT 1.0 stylesheet can be
# done by just replacing the for-each-group with a for-each and
# checking if the context node is the first in $parents/* with
# that name:
#
# <xsl:stylesheet version="1.0"
# xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:output method="xml" version="1.0" encoding="UTF-8"
# indent="yes"/>
#
# <xsl:template match="/">
# <xsl:call-template name="subtree">
# <xsl:with-param name="parents" select="/"/>
# </xsl:call-template>
# </xsl:template>
#
# <xsl:template name="subtree">
# <xsl:param name="parents"/>
# <xsl:for-each select="$parents/*">
# <xsl:variable name="current-group"
# select="$parents/*[name() = name(current())]"/>
# <xsl:if test="count($current-group[1] | .) = 1">
# <xsl:copy>
# <xsl:attribute name="ct">
# <xsl:value-of select="count($current-group)"/>
# </xsl:attribute>
# <xsl:call-template name="subtree">
# <xsl:with-param name="parents" select="$current-group"/>
# </xsl:call-template>
# </xsl:copy>
# </xsl:if>
# </xsl:for-each>
# </xsl:template>
# </xsl:stylesheet>
#
# But the performance of this on large files is hopeless. I
# tried a smaller 1.3MB, 75k elements input file that generates
# a 5kB, 125 element output. msxml took 2 minutes (memory usage
# displaying odd sawtooth pattern
# -- GC?), Saxon 7.9 over 5 minutes and libxslt 1.0.4 I gave up
# on after 10 minutes.
#
# [Aside: I wondered whether it might be better to expanded
# $current-group in the test and only created the variable
# within the if element. The rationale being that the variable
# might be created strictly whereas the combined xpath
# expression might only be evaluated up to the first element.
#
# ...
# <xsl:if test="count(($parents/*[name() =
# name(current())])[1] | .) = 1">
# <xsl:variable name="current-group"
# select="$parents/*[name() = name(current())]"/>
# ...
#
# For msxsl makes an enormous difference: it's now faster than
# Saxon 7 running the XSLT 2.0 stylesheet -- that's for the 7MB
# file! How does _that_ work?! It doesn't make any great
# difference to Saxon (or libxslt?).]
#
# Am I right that there's no way of using Muenchian grouping
# for this because there's no possible XPath 1.0 expression for
# the list of ancestors' node names?
#
# Are there any other approaches with XSLT 1 that might get a
# decent performance with libxslt?
#
# One thing I thought of is to do the transform in two steps:
# the first generates output isomorphic to the input elements,
# but annotates each with its ancestor name list (as a string);
# the second traverses that file using Muenchian grouping. This
# works -- albeit with rather variable performance
# -- in msxml. For the 1.3MB file step 2 takes 71 seconds in
# Saxon and 75 in libxslt (and 1 second in msxml). This confuses me!
#
# TODO: Saxon gets java.lang.OutOfMemoryError on the second
# stage reading the output of stage 1 for larger files. How do
# I increase Java's memory limit?
#
# <!-- step 1 -->
# <xsl:stylesheet version="1.0"
# xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:output method="xml" version="1.0" encoding="UTF-8"
# indent="yes"/>
#
# <xsl:template match="*">
# <xsl:param name="parent-path" select="''"/>
# <xsl:variable name="p" select="concat($parent-path,'/',name())"/>
# <xsl:copy>
# <xsl:attribute name="p">
# <xsl:value-of select="$p"/>
# </xsl:attribute>
# <xsl:apply-templates>
# <xsl:with-param name="parent-path" select="$p"/>
# </xsl:apply-templates>
# </xsl:copy>
# </xsl:template>
#
# <xsl:template match="text()"/>
# </xsl:stylesheet>
#
# <!-- step 2 -->
# <xsl:stylesheet version="1.0"
# xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
# <xsl:output method="xml" version="1.0" encoding="UTF-8"
# indent="yes"/>
#
# <xsl:key name="p" match="*" use="@p"/>
#
# <xsl:template match="*">
# <xsl:if test="generate-id() = generate-id(key('p', @p)[1])">
# <xsl:copy>
# <xsl:attribute name="ct">
# <xsl:value-of select="count(key('p', @p))"/>
# </xsl:attribute>
# <xsl:for-each select="key('p', @p)">
# <xsl:apply-templates/>
# </xsl:for-each>
# </xsl:copy>
# </xsl:if>
# </xsl:template>
#
# <xsl:template match="text()"/>
# </xsl:stylesheet>
#
# Any comments, pointers, code suggestions gratefully received!
#
# Cheers,
#
# Ian.
#
# XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
#
#
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list