Hi Xslt'ers,
I happen to run into an XSLT 2 behavior that I am uncertain of if it is
desired or expected behavior of implementations, if it is undesired, but
still expected, or if it should be prohibited.
Let me explain. Backtracking is a way that most regular expression
engines work with, where the engine tries to do a partial match on a
largest possible string and then "tracks back" to make the whole string
matching. But let me skip the details of this process...
Suppose you have a regular expression that does not match the target
string. With backtracking, unmatching strings take the longest
processing time, which is normal and desirable for several reasons. What
happens is that the engine tracks back some positions on the string and
tries to apply the regex again. The number of positions to step back is
roughly equivalent to the smallest possible match.
Now, suppose this smallest possible match is zero length. In this
situation, with classic regex parsers, the engine will try forever on a
non-matching string (or at least very long). In XSLT, this behavior
seems to happen even when the "smallest possible match" is of non-zero
length.
In XSLT this is shown as follows, and I use the example of matching a
CSV quoted string which may escape the quote by doubling it:
Good examples:
"well quoted csv string"
"well ""quoted"" csv string
Bad example:
not well "quoted csv string
The regular expression to match a quoted CSV string is
"([^"]+|"")*"
This gives trouble in the following XSLT (showing a non-matching
string), which runs for a very long time (exponential performance chart)
<xsl:variable name="ltr">
before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>
<xsl:analyze-string select="$ltr" regex="{$regex}">
<xsl:matching-substring>
<found><xsl:value-of select="." /></found>
</xsl:matching-substring>
<xsl:non-matching-substring>
<not-found><xsl:value-of select="." /></not-found>
</xsl:non-matching-substring>
</xsl:analyze-string>
The above example ran for > 10 minutes before I cancelled it (it did not
give any problem with matching strings, but did give problems with
strings that have a large non-matching part with a quote in it). A
simpler regex (but less useful) that shows the same behavior: "([^"]+)*"
I am under the impression that such behavior is not desirable, but I am
unsure if there is anything in the specs that says something about how
implementations should/must deal with this. As a comparison, I tried the
example with Perl, which gave no noticeable performance troubles.
I would like to add as a side note the danger of repeating empty
matching regular expressions. If I were to write the above regular
expression as follows:
([^"]*|("")*)*
then the subexpression [^"]*|("")* can match an empty string. Repeating
that indefinitely (* quantifier), causes regex engines to lock up (but
with Perl: no problem). This is a known problem with regexes and with
careful regex crafting it need not happen.
However, in the example above, I use a subexpression that never matches
an empty string and as such should not happen to fall into the
eternal-loop problem.
I use Saxon 8.8 most of the time, with Java 1.5. I tried with Altova
2006, but it does not allow xsl:analyze-string...
Note that it may or may not be possible to optimize the regex in such a
way that it fails earlier on the given string. But my problem is with
the (imo) unpredictability of the performance with any
less-then-very-trivial regex.
Cheers,
-- Abel Braaksma
http://www.nuntia.nl
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--