xsl-list
[Top] [All Lists]

Re: [xsl] Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?

2007-01-24 03:45:12
Michael Kay wrote:
(b) The specs have nothing to say about performance.

I understand. Sounds reasonable and logical.

(c) Saxon relies entirely on the regex engines in the underlying platform
(Java or .NET). I dare say there are cases where one regex engine will find
an optimization that another one misses.

I cannot speak for .NET, but in the thread on the Saxon list I go deeper into this. It's not just about a spurious optimization feat or mismatch, it is about quite a common case.

(d) I was under the impression that regular expression evaluation will
always terminate, though of course it's possible to construct cases where
that might require extreme patience to observe.

In technical terms, you are right, however, not if you include the definition the astronomers have given about time: it ends some day. Jeffrey Friedl gives a calculation on page 227 of his 2nd ed book that a 50 char string with a bad regular expression, may run for a years. Here's the same with Java:

25 char:  2.5 seconds
26 char: 5.3 seconds
27 char: 10.7 seconds
28 char: 21.4 seconds
35 char: 2739 seconds
40 char: 24.4 hours
50 char: almost 3 years
85 char: 97 * 10^9 years (97 billion years)

Meaning that if I apply such a regex to a corpus of over 85 characters, the process will stop long after the the whole universe has died out (current thesis: it will last about 30 billion years at most). Which is pretty long ;)

Calculation of this performance penalty: 2^(stringlen) * (time-one-char-match)

Java (Sun) is just a little behind with adding this optimization, because really, there is no need in testing any string more than string-len times, usually much less even.

-- Abel Braaksma

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--