From just looking at your stylesheet, I noticed a couple of
things, but I don't know whether changes will make it faster.
I noticed a few stylistic things too. I hate the verbosity of
<xsl:element name="section">
<xsl:attribute name="ref" select="$extract-section"/>
<xsl:attribute name="name"
select="normalize-space($section-name)"/>
when you could write
<section ref="{$extract-section}"
name="{normalize-space($section-name)}"
But that's not a performance issue, and nor are most of the points Abel
made; and I have to say I couldn't see anything at all here that should
cause performance problems.
Abel might be right about the regular expression - innocent-looking regexes
can sometimes catch you out - but this one looks as if it will give a
no-match on most input lines very quickly with no backtracking needed.
So, let's have some data:
* what processor/version are you using?
* how are you running it?
* what's the size of the input data?
* how long is it actually taking?
Michael Kay
http://www.saxonica.com/
You didn't specify what processor you use. If you use
AltovaXML, it can at times be extremely slow (exponential
performance) and it is worthwhile to try your code with a
more optimized processor like Saxon.
From the code I notice that you use XSLT 2.0, which can
usually be more easily optimized than XSLT 1.0, both in code
(tail recursion and using "as" attributes to specify result
types) and in the processor, because the language allows for
easier optimizations of common tasks (like regular
expressions instead of recursive templates).
But you still seem to use a lot of XSLT 1.0 techniques where
I would prefer the 2.0 version. Consider putting your
xsl:call-template (named
templates) in an xsl:function (even recursively). Consider using
if(value) then .... else ... instead of xsl:if or xsl:choose.
Consider using matching templates instead of xsl:when etc,
which may perform faster.
But your main points of performance penalties lie in the fact
of passing on the following-sibling axis and walking it one
by one. You can do this same trick with matching templates
alone, and you are probably better off using keys to optimize
performance, or to introduce a for-each or a for-each-group.
Anything is better than the recursive named template.
If that does not improve things, you should have a look at
some of the backtracking problems your regular expression
will cause. The regular expression parser used by Saxon is
the same as the one from Java and it has quite a bad
performance when it comes to quadratic backtracking (of the
form: (x+)+). I haven't looked into it enough, but if you can
rewrite it for less backtracking, or optimize the regex to
match the most common situation, or even pass it on in a
doubly nested (awkward, I know, but hey you are optimizing
for speed) xsl:analyze-string then you may profit a lot for speed.
It is hard to predict the behavior of a regular expression. I
once made a very simple regular expression for matching CSV
records which took exponential performance when the overall
match for the CSV line failed (i.e., non-matching quote
pair). This regex took about 1.5 hour for a string of 60
characters (and it doubled for each extra 3 characters, this
regex is somewhere on the Saxon list)! Rewriting it for less
backtracking improved the performance to linear.
If the regex is indeed the problem (test is with something
straightforward) then I suggest you read the regex optimizing
chapter in Jeffrey Friedl's now famous book on regular expressions.
HTH,
Cheers,
-- Abel Braaksma
PS: not all hints above will necessarily or predictably
improve performance
PPS: you do not need the namespace for the XPath functions,
after all, for some functions you do use the fn: prefix, for
others you don't...
You can just leave it out.
Mathieu Malaterre wrote:
Hi there,
I have a working version of an XSLT script:
http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/
See (*) and (**). What I would like to do is :
1. Be able to run the xslt in one pass. For now I have to
run it with
<xsl:param name="extract-section" select="'C.1'"/>
then edit test.xsl file, comment the line and uncomment:
<-xsl:param name="extract-section" select="'C.2'"/>
and so on and so forth...
2. This script is seriously *slow*. I guess runnning it in one pass
should solve most of the issue, but if there was something obvious I
was missing... thanks !
-Mathieu
(*)
$ cat test.xml
<?xml version="1.0"?>
<article>
<para>C.1 Title 1</para>
<para>info for section C.1</para>
<informaltable>table1</informaltable>
<para>C.2 Title 2</para>
<informaltable>table2</informaltable>
<para>info for section C.2</para>
<para>C.2.1 Title 2.1</para>
<para>text for section C.2.1</para>
<para>text for section C.2.1 again</para>
<para>C.2.2 Tile 2.2</para>
<informaltable>table for 2.2</informaltable>
<para>text for section C.2.2</para>
</article>
(**)
$ cat test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fn="http://www.w3.org/2005/xpath-functions" version="2.0">
<!-- GENERAL -->
<xsl:output method="xml" indent="yes" encoding="UTF-8"/>
<!-- number of the sample section to be extracted -->
<!--xsl:param name="extract-section" select="'C.1'"/-->
<!--xsl:param name="extract-section" select="'C.2'"/-->
<!--xsl:param name="extract-section" select="'C.2.1'"/-->
<xsl:param name="extract-section" select="'C.2.2'"/>
<xsl:template match="para">
<text>
<xsl:value-of select="concat(.,' ')"/>
</text>
</xsl:template>
<xsl:template match="informaltable">
<table>
<xsl:value-of select="concat(.,' ')"/>
</table>
</xsl:template>
<!-- MAIN -->
<xsl:template match="/article">
<xsl:variable name="section-number"
select="concat($extract-section,' ')"/>
<xsl:variable name="section-anchor"
select="para[starts-with(normalize-space(.),$section-number)]"/>
<xsl:variable name="section-name"
select="substring-after(para[starts-with(normalize-space(.),$s
ection-number)],$extract-section)"/>
<xsl:choose>
<xsl:when test="count($section-anchor)=1">
<xsl:message>Info: section <xsl:value-of
select="$extract-section"/> found</xsl:message>
<xsl:element name="section">
<xsl:attribute name="ref" select="$extract-section"/>
<xsl:attribute name="name"
select="normalize-space($section-name)"/>
<xsl:call-template name="copy-section-paragraphs">
<xsl:with-param name="section-paragraphs"
select="$section-anchor/following-sibling::*"/>
</xsl:call-template>
</xsl:element>
<xsl:message>Info: all paragraphs extracted</xsl:message>
</xsl:when>
<xsl:when test="count($section-anchor)>1">
<xsl:message>Error: section <xsl:value-of
select="$extract-section"/> found multiple times!</xsl:message>
</xsl:when>
<xsl:otherwise>
<xsl:message>Error: section <xsl:value-of
select="$extract-section"/> not found!</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- TEMPLATES -->
<xsl:template name="copy-section-paragraphs">
<xsl:param name="section-paragraphs"/>
<xsl:variable name="current-paragraph"
select="$section-paragraphs[1]"/>
<!-- search for next section title -->
<xsl:if test="($current-paragraph[name()='para' or
name()='informaltable']) and
not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1
-9]+[0-9]?)(\.[1-9]?[0-9]+)+
'))">
<!-- output current paragraph (close with a newline) -->
<xsl:apply-templates select="$current-paragraph"/>
<xsl:call-template name="copy-section-paragraphs">
<xsl:with-param name="section-paragraphs"
select="$section-paragraphs[position()>1]"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--