Re: [xsl] How to make this script faster
2007-11-15 15:08:26
Hi Mathieu,
From just looking at your stylesheet, I noticed a couple of things, but
I don't know whether changes will make it faster. You didn't specify
what processor you use. If you use AltovaXML, it can at times be
extremely slow (exponential performance) and it is worthwhile to try
your code with a more optimized processor like Saxon.
From the code I notice that you use XSLT 2.0, which can usually be more
easily optimized than XSLT 1.0, both in code (tail recursion and using
"as" attributes to specify result types) and in the processor, because
the language allows for easier optimizations of common tasks (like
regular expressions instead of recursive templates).
But you still seem to use a lot of XSLT 1.0 techniques where I would
prefer the 2.0 version. Consider putting your xsl:call-template (named
templates) in an xsl:function (even recursively). Consider using
if(value) then .... else ... instead of xsl:if or xsl:choose. Consider
using matching templates instead of xsl:when etc, which may perform faster.
But your main points of performance penalties lie in the fact of passing
on the following-sibling axis and walking it one by one. You can do this
same trick with matching templates alone, and you are probably better
off using keys to optimize performance, or to introduce a for-each or a
for-each-group. Anything is better than the recursive named template.
If that does not improve things, you should have a look at some of the
backtracking problems your regular expression will cause. The regular
expression parser used by Saxon is the same as the one from Java and it
has quite a bad performance when it comes to quadratic backtracking (of
the form: (x+)+). I haven't looked into it enough, but if you can
rewrite it for less backtracking, or optimize the regex to match the
most common situation, or even pass it on in a doubly nested (awkward, I
know, but hey you are optimizing for speed) xsl:analyze-string then you
may profit a lot for speed.
It is hard to predict the behavior of a regular expression. I once made
a very simple regular expression for matching CSV records which took
exponential performance when the overall match for the CSV line failed
(i.e., non-matching quote pair). This regex took about 1.5 hour for a
string of 60 characters (and it doubled for each extra 3 characters,
this regex is somewhere on the Saxon list)! Rewriting it for less
backtracking improved the performance to linear.
If the regex is indeed the problem (test is with something
straightforward) then I suggest you read the regex optimizing chapter in
Jeffrey Friedl's now famous book on regular expressions.
HTH,
Cheers,
-- Abel Braaksma
PS: not all hints above will necessarily or predictably improve performance
PPS: you do not need the namespace for the XPath functions, after all,
for some functions you do use the fn: prefix, for others you don't...
You can just leave it out.
Mathieu Malaterre wrote:
Hi there,
I have a working version of an XSLT script:
http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/
See (*) and (**). What I would like to do is :
1. Be able to run the xslt in one pass. For now I have to run it with
<xsl:param name="extract-section" select="'C.1'"/>
then edit test.xsl file, comment the line and uncomment:
<-xsl:param name="extract-section" select="'C.2'"/>
and so on and so forth...
2. This script is seriously *slow*. I guess runnning it in one pass
should solve most of the issue, but if there was something obvious I
was missing... thanks !
-Mathieu
(*)
$ cat test.xml
<?xml version="1.0"?>
<article>
<para>C.1 Title 1</para>
<para>info for section C.1</para>
<informaltable>table1</informaltable>
<para>C.2 Title 2</para>
<informaltable>table2</informaltable>
<para>info for section C.2</para>
<para>C.2.1 Title 2.1</para>
<para>text for section C.2.1</para>
<para>text for section C.2.1 again</para>
<para>C.2.2 Tile 2.2</para>
<informaltable>table for 2.2</informaltable>
<para>text for section C.2.2</para>
</article>
(**)
$ cat test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fn="http://www.w3.org/2005/xpath-functions" version="2.0">
<!-- GENERAL -->
<xsl:output method="xml" indent="yes" encoding="UTF-8"/>
<!-- number of the sample section to be extracted -->
<!--xsl:param name="extract-section" select="'C.1'"/-->
<!--xsl:param name="extract-section" select="'C.2'"/-->
<!--xsl:param name="extract-section" select="'C.2.1'"/-->
<xsl:param name="extract-section" select="'C.2.2'"/>
<xsl:template match="para">
<text>
<xsl:value-of select="concat(.,' ')"/>
</text>
</xsl:template>
<xsl:template match="informaltable">
<table>
<xsl:value-of select="concat(.,' ')"/>
</table>
</xsl:template>
<!-- MAIN -->
<xsl:template match="/article">
<xsl:variable name="section-number" select="concat($extract-section,' ')"/>
<xsl:variable name="section-anchor"
select="para[starts-with(normalize-space(.),$section-number)]"/>
<xsl:variable name="section-name"
select="substring-after(para[starts-with(normalize-space(.),$section-number)],$extract-section)"/>
<xsl:choose>
<xsl:when test="count($section-anchor)=1">
<xsl:message>Info: section <xsl:value-of
select="$extract-section"/> found</xsl:message>
<xsl:element name="section">
<xsl:attribute name="ref" select="$extract-section"/>
<xsl:attribute name="name" select="normalize-space($section-name)"/>
<xsl:call-template name="copy-section-paragraphs">
<xsl:with-param name="section-paragraphs"
select="$section-anchor/following-sibling::*"/>
</xsl:call-template>
</xsl:element>
<xsl:message>Info: all paragraphs extracted</xsl:message>
</xsl:when>
<xsl:when test="count($section-anchor)>1">
<xsl:message>Error: section <xsl:value-of
select="$extract-section"/> found multiple times!</xsl:message>
</xsl:when>
<xsl:otherwise>
<xsl:message>Error: section <xsl:value-of
select="$extract-section"/> not found!</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- TEMPLATES -->
<xsl:template name="copy-section-paragraphs">
<xsl:param name="section-paragraphs"/>
<xsl:variable name="current-paragraph" select="$section-paragraphs[1]"/>
<!-- search for next section title -->
<xsl:if test="($current-paragraph[name()='para' or
name()='informaltable']) and
not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1-9]+[0-9]?)(\.[1-9]?[0-9]+)+
'))">
<!-- output current paragraph (close with a newline) -->
<xsl:apply-templates select="$current-paragraph"/>
<xsl:call-template name="copy-section-paragraphs">
<xsl:with-param name="section-paragraphs"
select="$section-paragraphs[position()>1]"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
|
|