I just had a user using a wildcard pattern that took a large amount of
CPU time. Now that this particular problem is solved, I wonder about
":matches".
Does anybody know the complexity order of recognising wildcard patterns,
which are a subset of regular expressions?
The complexity of a naive implementation of greedy glob-matching is pretty
clearly O(M*N**K), where N is the length of the target string, M is the length
of the pattern, and K is the number of *'s present in the pattern, and it is
pretty easy to construct inputs that will exhibit worse-case behavior - all you
need is lots of *'s in a pattern that never matches.
Using a non-greedy approach makes it much harder to construct inputs that will
take a long time to process, especially if you also do some simple
optimizations like collapsing strings of repeated *'s. In practice this seems
to avoid many of the more common problems, such as when users specify an
argument to :matches that only makes sense to give to :contains, such as
"******** ATTENTION ********".
However, I do not know if simply using an optimized non-greedy approach
actually reduces the computational complexity to something less than O(M*N**K)
for extremely pathological cases. If it doesn't, my guess is that there's some
sort of dynamic programming trick that can get the complexity down to O(N*M) or
something similar. I haven't tried to find such a trick, mostly because we've
never encountered a pathological case when using optimized non-greedy matching
that would make it relevant.
In any case, the computational complexity of glob-style matches is clearly less
than that of regexp, which is hardly surprising given that regexps require
a NFA. This is why we disable regexp support by default in our implementation.
Perhaps it makes sense to recommend a non-greedy approach in the security
considerations section of the base specification? Allowing implementations to
limit the number of *'s they allog in glob-style patterns might also make
sense, if only to catch cases where users used :matches when they intended to
use :contains.
Ned
P.S. I'm curious what pattern you encountered that took a long time. Mind
sharing it?