RE: [xsl] How to make this script faster


 From just looking at your stylesheet, I noticed a couple of 
things, but I don't know whether changes will make it faster.


I noticed a few stylistic things too. I hate the verbosity of

      <xsl:element name="section">
        <xsl:attribute name="ref" select="$extract-section"/>
        <xsl:attribute name="name"

select="normalize-space($section-name)"/>

when you could write
          <section ref="{$extract-section}"
name="{normalize-space($section-name)}"

But that's not a performance issue, and nor are most of the points Abel
made; and I have to say I couldn't see anything at all here that should
cause performance problems.

Abel might be right about the regular expression - innocent-looking regexes
can sometimes catch you out - but this one looks as if it will give a
no-match on most input lines very quickly with no backtracking needed.

So, let's have some data:

* what processor/version are you using?

* how are you running it?

* what's the size of the input data?

* how long is it actually taking?

Michael Kay
http://www.saxonica.com/

You didn't specify what processor you use. If you use 
AltovaXML, it can at times be extremely slow (exponential 
performance) and it is worthwhile to try your code with a 
more optimized processor like Saxon.

 From the code I notice that you use XSLT 2.0, which can 
usually be more easily optimized than XSLT 1.0, both in code 
(tail recursion and using "as" attributes to specify result 
types) and in the processor, because the language allows for 
easier optimizations of common tasks (like regular 
expressions instead of recursive templates).

But you still seem to use a lot of XSLT 1.0 techniques where 
I would prefer the 2.0 version. Consider putting your 
xsl:call-template (named
templates) in an xsl:function (even recursively). Consider using
if(value) then .... else ... instead of xsl:if or xsl:choose. 
Consider using matching templates instead of xsl:when etc, 
which may perform faster.

But your main points of performance penalties lie in the fact 
of passing on the following-sibling axis and walking it one 
by one. You can do this same trick with matching templates 
alone, and you are probably better off using keys to optimize 
performance, or to introduce a for-each or a for-each-group. 
Anything is better than the recursive named template.

If that does not improve things, you should have a look at 
some of the backtracking problems your regular expression 
will cause. The regular expression parser used by Saxon is 
the same as the one from Java and it has quite a bad 
performance when it comes to quadratic backtracking (of the 
form: (x+)+). I haven't looked into it enough, but if you can 
rewrite it for less backtracking, or optimize the regex to 
match the most common situation, or even pass it on in a 
doubly nested (awkward, I know, but hey you are optimizing 
for speed) xsl:analyze-string then you may profit a lot for speed.

It is hard to predict the behavior of a regular expression. I 
once made a very simple regular expression for matching CSV 
records which took exponential performance when the overall 
match for the CSV line failed (i.e., non-matching quote 
pair). This regex took about 1.5 hour for a string of 60 
characters (and it doubled for each extra 3 characters, this 
regex is somewhere on the Saxon list)! Rewriting it for less 
backtracking improved the performance to linear.

If the regex is indeed the problem (test is with something
straightforward) then I suggest you read the regex optimizing 
chapter in Jeffrey Friedl's now famous book on regular expressions.

HTH,
Cheers,
-- Abel Braaksma

PS: not all hints above will necessarily or predictably 
improve performance
PPS: you do not need the namespace for the XPath functions, 
after all, for some functions you do use the fn: prefix, for 
others you don't... 
You can just leave it out.


Mathieu Malaterre wrote:

Hi there,

  I have a working version of an XSLT script:
http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/

  See (*) and (**). What I would like to do is :

1. Be able to run the xslt in one pass. For now I have to

run it with

<xsl:param name="extract-section" select="'C.1'"/>
then edit test.xsl file, comment the line and uncomment:
<-xsl:param name="extract-section" select="'C.2'"/>
and so on and so forth...

2. This script is seriously *slow*. I guess runnning it in one pass
should solve most of the issue, but if there was something obvious I
was missing... thanks !

-Mathieu

(*)
$ cat test.xml
<?xml version="1.0"?>
<article>
  <para>C.1 Title 1</para>
  <para>info for section C.1</para>
  <informaltable>table1</informaltable>
  <para>C.2 Title 2</para>
  <informaltable>table2</informaltable>
  <para>info for section C.2</para>
  <para>C.2.1 Title 2.1</para>
  <para>text for section C.2.1</para>
  <para>text for section C.2.1 again</para>
  <para>C.2.2 Tile 2.2</para>
  <informaltable>table for 2.2</informaltable>
  <para>text for section C.2.2</para>
</article>

(**)
$ cat test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:fn="http://www.w3.org/2005/xpath-functions"; version="2.0">

<!-- GENERAL -->

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<!-- number of the sample section to be extracted -->
<!--xsl:param name="extract-section" select="'C.1'"/-->
<!--xsl:param name="extract-section" select="'C.2'"/-->
<!--xsl:param name="extract-section" select="'C.2.1'"/-->
<xsl:param name="extract-section" select="'C.2.2'"/>


<xsl:template match="para">
<text>
<xsl:value-of select="concat(.,'&#10;')"/>
</text>
</xsl:template>

<xsl:template match="informaltable">
<table>
<xsl:value-of select="concat(.,'&#10;')"/>
</table>
</xsl:template>

<!-- MAIN -->

<xsl:template match="/article">
  <xsl:variable name="section-number"

select="concat($extract-section,' ')"/>

  <xsl:variable name="section-anchor"
select="para[starts-with(normalize-space(.),$section-number)]"/>
  <xsl:variable name="section-name"

select="substring-after(para[starts-with(normalize-space(.),$s
ection-number)],$extract-section)"/>

  <xsl:choose>
    <xsl:when test="count($section-anchor)=1">
      <xsl:message>Info: section <xsl:value-of
select="$extract-section"/> found</xsl:message>
      <xsl:element name="section">
        <xsl:attribute name="ref" select="$extract-section"/>
        <xsl:attribute name="name"

select="normalize-space($section-name)"/>

        <xsl:call-template name="copy-section-paragraphs">
          <xsl:with-param name="section-paragraphs"
select="$section-anchor/following-sibling::*"/>
        </xsl:call-template>
      </xsl:element>
      <xsl:message>Info: all paragraphs extracted</xsl:message>
    </xsl:when>
    <xsl:when test="count($section-anchor)>1">
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> found multiple times!</xsl:message>
    </xsl:when>
    <xsl:otherwise>
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> not found!</xsl:message>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

<!-- TEMPLATES -->

<xsl:template name="copy-section-paragraphs">
  <xsl:param name="section-paragraphs"/>
  <xsl:variable name="current-paragraph"

select="$section-paragraphs[1]"/>

  <!-- search for next section title -->
  <xsl:if test="($current-paragraph[name()='para' or
name()='informaltable']) and

not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1
-9]+[0-9]?)(\.[1-9]?[0-9]+)+

'))">
    <!-- output current paragraph (close with a newline) -->
    <xsl:apply-templates select="$current-paragraph"/>
    <xsl:call-template name="copy-section-paragraphs">
      <xsl:with-param name="section-paragraphs"
select="$section-paragraphs[position()>1]"/>
    </xsl:call-template>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--