Re: [xsl] How to make this script faster

Hi Mathieu,

From just looking at your stylesheet, I noticed a couple of things, butI don't know whether changes will make it faster. You didn't specifywhat processor you use. If you use AltovaXML, it can at times beextremely slow (exponential performance) and it is worthwhile to tryyour code with a more optimized processor like Saxon.

From the code I notice that you use XSLT 2.0, which can usually be moreeasily optimized than XSLT 1.0, both in code (tail recursion and using"as" attributes to specify result types) and in the processor, becausethe language allows for easier optimizations of common tasks (likeregular expressions instead of recursive templates).

But you still seem to use a lot of XSLT 1.0 techniques where I wouldprefer the 2.0 version. Consider putting your xsl:call-template (namedtemplates) in an xsl:function (even recursively). Consider usingif(value) then .... else ... instead of xsl:if or xsl:choose. Considerusing matching templates instead of xsl:when etc, which may perform faster.

But your main points of performance penalties lie in the fact of passingon the following-sibling axis and walking it one by one. You can do thissame trick with matching templates alone, and you are probably betteroff using keys to optimize performance, or to introduce a for-each or afor-each-group. Anything is better than the recursive named template.

If that does not improve things, you should have a look at some of thebacktracking problems your regular expression will cause. The regularexpression parser used by Saxon is the same as the one from Java and ithas quite a bad performance when it comes to quadratic backtracking (ofthe form: (x+)+). I haven't looked into it enough, but if you canrewrite it for less backtracking, or optimize the regex to match themost common situation, or even pass it on in a doubly nested (awkward, Iknow, but hey you are optimizing for speed) xsl:analyze-string then youmay profit a lot for speed.

It is hard to predict the behavior of a regular expression. I once madea very simple regular expression for matching CSV records which tookexponential performance when the overall match for the CSV line failed(i.e., non-matching quote pair). This regex took about 1.5 hour for astring of 60 characters (and it doubled for each extra 3 characters,this regex is somewhere on the Saxon list)! Rewriting it for lessbacktracking improved the performance to linear.

If the regex is indeed the problem (test is with somethingstraightforward) then I suggest you read the regex optimizing chapter inJeffrey Friedl's now famous book on regular expressions.


HTH,
Cheers,
-- Abel Braaksma

PS: not all hints above will necessarily or predictably improve performance

PPS: you do not need the namespace for the XPath functions, after all,for some functions you do use the fn: prefix, for others you don't...You can just leave it out.



Mathieu Malaterre wrote:

Hi there,

  I have a working version of an XSLT script:
http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/

  See (*) and (**). What I would like to do is :

1. Be able to run the xslt in one pass. For now I have to run it with
<xsl:param name="extract-section" select="'C.1'"/>
then edit test.xsl file, comment the line and uncomment:
<-xsl:param name="extract-section" select="'C.2'"/>
and so on and so forth...

2. This script is seriously *slow*. I guess runnning it in one pass
should solve most of the issue, but if there was something obvious I
was missing... thanks !

-Mathieu

(*)
$ cat test.xml
<?xml version="1.0"?>
<article>
  <para>C.1 Title 1</para>
  <para>info for section C.1</para>
  <informaltable>table1</informaltable>
  <para>C.2 Title 2</para>
  <informaltable>table2</informaltable>
  <para>info for section C.2</para>
  <para>C.2.1 Title 2.1</para>
  <para>text for section C.2.1</para>
  <para>text for section C.2.1 again</para>
  <para>C.2.2 Tile 2.2</para>
  <informaltable>table for 2.2</informaltable>
  <para>text for section C.2.2</para>
</article>

(**)
$ cat test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:fn="http://www.w3.org/2005/xpath-functions"; version="2.0">

<!-- GENERAL -->

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<!-- number of the sample section to be extracted -->
<!--xsl:param name="extract-section" select="'C.1'"/-->
<!--xsl:param name="extract-section" select="'C.2'"/-->
<!--xsl:param name="extract-section" select="'C.2.1'"/-->
<xsl:param name="extract-section" select="'C.2.2'"/>


<xsl:template match="para">
<text>
<xsl:value-of select="concat(.,'&#10;')"/>
</text>
</xsl:template>

<xsl:template match="informaltable">
<table>
<xsl:value-of select="concat(.,'&#10;')"/>
</table>
</xsl:template>

<!-- MAIN -->

<xsl:template match="/article">
  <xsl:variable name="section-number" select="concat($extract-section,' ')"/>
  <xsl:variable name="section-anchor"
select="para[starts-with(normalize-space(.),$section-number)]"/>
  <xsl:variable name="section-name"
select="substring-after(para[starts-with(normalize-space(.),$section-number)],$extract-section)"/>
  <xsl:choose>
    <xsl:when test="count($section-anchor)=1">
      <xsl:message>Info: section <xsl:value-of
select="$extract-section"/> found</xsl:message>
      <xsl:element name="section">
        <xsl:attribute name="ref" select="$extract-section"/>
        <xsl:attribute name="name" select="normalize-space($section-name)"/>
        <xsl:call-template name="copy-section-paragraphs">
          <xsl:with-param name="section-paragraphs"
select="$section-anchor/following-sibling::*"/>
        </xsl:call-template>
      </xsl:element>
      <xsl:message>Info: all paragraphs extracted</xsl:message>
    </xsl:when>
    <xsl:when test="count($section-anchor)>1">
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> found multiple times!</xsl:message>
    </xsl:when>
    <xsl:otherwise>
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> not found!</xsl:message>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

<!-- TEMPLATES -->

<xsl:template name="copy-section-paragraphs">
  <xsl:param name="section-paragraphs"/>
  <xsl:variable name="current-paragraph" select="$section-paragraphs[1]"/>
  <!-- search for next section title -->
  <xsl:if test="($current-paragraph[name()='para' or
name()='informaltable']) and
not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1-9]+[0-9]?)(\.[1-9]?[0-9]+)+
'))">
    <!-- output current paragraph (close with a newline) -->
    <xsl:apply-templates select="$current-paragraph"/>
    <xsl:call-template name="copy-section-paragraphs">
      <xsl:with-param name="section-paragraphs"
select="$section-paragraphs[position()>1]"/>
    </xsl:call-template>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--