[xsl] Backtracking and eternal loops caused by regular expressions match

Hi Xslt'ers,

I happen to run into an XSLT 2 behavior that I am uncertain of if it isdesired or expected behavior of implementations, if it is undesired, butstill expected, or if it should be prohibited.

Let me explain. Backtracking is a way that most regular expressionengines work with, where the engine tries to do a partial match on alargest possible string and then "tracks back" to make the whole stringmatching. But let me skip the details of this process...

Suppose you have a regular expression that does not match the targetstring. With backtracking, unmatching strings take the longestprocessing time, which is normal and desirable for several reasons. Whathappens is that the engine tracks back some positions on the string andtries to apply the regex again. The number of positions to step back isroughly equivalent to the smallest possible match.

Now, suppose this smallest possible match is zero length. In thissituation, with classic regex parsers, the engine will try forever on anon-matching string (or at least very long). In XSLT, this behaviorseems to happen even when the "smallest possible match" is of non-zerolength.

In XSLT this is shown as follows, and I use the example of matching aCSV quoted string which may escape the quote by doubling it:

Good examples:
"well quoted csv string"
"well ""quoted"" csv string

Bad example:
not well "quoted csv string

The regular expression to match a quoted CSV string is
"([^"]+|"")*"

This gives trouble in the following XSLT (showing a non-matchingstring), which runs for a very long time (exponential performance chart)


<xsl:variable name="ltr">
   before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>

<xsl:analyze-string select="$ltr" regex="{$regex}">
   <xsl:matching-substring>
      <found><xsl:value-of select="." /></found>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <not-found><xsl:value-of select="." /></not-found>
   </xsl:non-matching-substring>
</xsl:analyze-string>

The above example ran for > 10 minutes before I cancelled it (it did notgive any problem with matching strings, but did give problems withstrings that have a large non-matching part with a quote in it). Asimpler regex (but less useful) that shows the same behavior: "([^"]+)*"

I am under the impression that such behavior is not desirable, but I amunsure if there is anything in the specs that says something about howimplementations should/must deal with this. As a comparison, I tried theexample with Perl, which gave no noticeable performance troubles.

I would like to add as a side note the danger of repeating emptymatching regular expressions. If I were to write the above regularexpression as follows:


([^"]*|("")*)*

then the subexpression [^"]*|("")* can match an empty string. Repeatingthat indefinitely (* quantifier), causes regex engines to lock up (butwith Perl: no problem). This is a known problem with regexes and withcareful regex crafting it need not happen.

However, in the example above, I use a subexpression that never matchesan empty string and as such should not happen to fall into theeternal-loop problem.

I use Saxon 8.8 most of the time, with Java 1.5. I tried with Altova2006, but it does not allow xsl:analyze-string...

Note that it may or may not be possible to optimize the regex in such away that it fails earlier on the given string. But my problem is withthe (imo) unpredictability of the performance with anyless-then-very-trivial regex.




Cheers,
-- Abel Braaksma
  http://www.nuntia.nl


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

[xsl] Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?