xsl-list
[Top] [All Lists]

Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))

2011-07-13 18:13:08
O
Another interesting article is this one describing some of the
optimizations performed by the regex engine in Google Chrome:
http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

This mentions another trick used by some regex implementations.  In
their example "Sun|Mon", their engine recognises that a match for this
expression always contains "n" in the third character, and so rather
than testing for a match at each index in the string (which was the
problem with the example given) they first scan the string to find "n"
characters and only try to apply the regex starting two characters
preceding one.

This is similar to the Boyer-Moore algorithm that Perl (and probably
pcre) also uses - originally it was just for constant strings. There was
a fun race to get the fastest "grep" in the early 1990s.

The difficulty with a lot of these optimizations is that capturing
groups can make them harder - and in Perl people use capturing groups
all the time of course. You can't turn (font|foot) into fo(nt|ot), and
although you can turn it into (fo(?:nt|ot)) you may then lose the
benefit.

Since the article Mike Key mentioned was written in 2007, Perl's regexps
seem to have got a lot faster. But even in the early 1990s I remember
converting sed scripts into perl and getting huge speedup, even though
Perl was using recursive back-tracking. The BM-style delta tables
probably accounted for most of that.

Liam


-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--