[Top] [All Lists]

Re: Complexity of :matches ?

2005-07-11 09:10:24

I just had a user using a wildcard pattern that took a large amount of
CPU time.  Now that this particular problem is solved, I wonder about

Does anybody know the complexity order of recognising wildcard patterns,
which are a subset of regular expressions?

The complexity of a naive implementation of greedy glob-matching is pretty
clearly O(M*N**K), where N is the length of the target string, M is the length
of the pattern, and K is the number of *'s present in the pattern, and it is
pretty easy to construct inputs that will exhibit worse-case behavior - all you
need is lots of *'s in a pattern that never matches.

Using a  non-greedy approach makes it much harder to construct inputs that will
take a long time to process, especially if you also do some simple
optimizations like collapsing strings of repeated *'s. In practice this seems
to avoid many of the more common problems, such as when users specify an
argument to :matches that only makes sense to give to :contains, such as
"******** ATTENTION ********".

However, I do not know if simply using an optimized non-greedy approach
actually reduces the computational complexity to something less than O(M*N**K)
for extremely pathological cases. If it doesn't, my guess is that there's some
sort of dynamic programming trick that can get the complexity down to O(N*M) or
something similar. I haven't tried to find such a trick, mostly because we've
never encountered a pathological case when using optimized non-greedy matching
that would make it relevant.

In any case, the computational complexity of glob-style matches is clearly less
than that of regexp, which is hardly surprising given that regexps require
a NFA. This is why we disable regexp support by default in our implementation.

Perhaps it makes sense to recommend a non-greedy approach in the security
considerations section of the base specification? Allowing implementations to
limit the number of *'s they allog in glob-style patterns might also make
sense, if only to catch cases where users used :matches when they intended to
use :contains.


P.S. I'm curious what pattern you encountered that took a long time. Mind
sharing it?

<Prev in Thread] Current Thread [Next in Thread>