xsl-list
[Top] [All Lists]

Re: [xsl] Which is less expensive group by or select distinct-values

2016-07-15 16:14:21
group-by and distinct-values are both going to have fairly similar time and 
memory characteristics, but of course the details depend on the specific 
processor.

But there are some very odd things going on in this code.


<xsl:variable name="TermList">
<xsl:value-of select="distinct-values(.//term[not(@keyref)])" 
separator=", " />

xsl:variable with an xsl:value-of child always has a bad smell. Why are you 
constructing an XML tree fragment when all you want is a string? In 99% of 
cases it should be <xsl:variable name="x" select="y"/>.

More important, why are the distinct values being concatenated into a single 
comma-separated string, only to be tokenized again immediately afterwards?

</xsl:variable>
<data type="topicreport" name="WDTermList">
 <xsl:for-each select="tokenize(normalize-space($TermList), ', ')">
      <xsl:sort select="." />
      <xsl:value-of select="."/>
        <xsl:if test="position() != last()">, </xsl:if>
  </xsl:for-each>
</data>

And then turned back into a comma-separated string again, this time in sorted 
order.

If this hadn't existed in the stylesheet already, I would have probably
done something like:

<xsl:for-each-group select=".//term[not(@keyref)])" group-by=".">
  <xsl:sort select="current-grouping-key()" />
  <xsl:value-of select="current-grouping-key()"/>
  <xsl:if test="position() != last()">, </xsl:if>
</xsl:for-each-group>

That's certainly a lot better, assuming the comma-separation of the sorted list 
is actually wanted. Personally, I would write:

<xsl:for-each select="distinct-values(.//term[not(@keyref)])">
  <xsl:sort select="."/>
  <xsl:if test="position() ne 1">, </xsl:if>
  <xsl:value-of select="."/>
</xsl:for-each>

Note that putting a comma before every item except the first, rather than after 
every item except the last, is less likely to disrupt the processing pipeline 
by calling last() right at the beginning, and can therefore reduce memory 
usage. Saxon will usually handle either form OK, but you don't want to be 
over-reliant on the optimizer recognizing such coding patterns.

With distinct-values, the memory needed is for the set of distinct values. With 
for-each-group, it's much more likely that the memory requirement will be one 
entry for each distinct value, where the entry holds both the value, and the 
list of nodes having that value, which you don't need in this case.

I don't think the above is my major time synch in this process but it is
one class of things that I'm reporting. I think the real processing time
issue is coming from a lot of string analysis/parsing that is occurring.

Indeed, the costs might not come from this part of the code at all.

Michael Kay
Saxonica
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>