The complexity of a naive implementation of greedy glob-matching is pretty
clearly O(M*N**K), where N is the length of the target string, M is the length
of the pattern, and K is the number of *'s present in the pattern, and it is
pretty easy to construct inputs that will exhibit worse-case behavior - all
you
need is lots of *'s in a pattern that never matches.
Right, and that's what happened. It wasn't a problem until recently, when
users started to use multiple stars instead of one, like
********(_at_)domain(_dot_)
Don't ask me why.
However, I do not know if simply using an optimized non-greedy approach
actually reduces the computational complexity to something less than O(M*N**K)
for extremely pathological cases. If it doesn't, my guess is that there's some
sort of dynamic programming trick that can get the complexity down to O(N*M)
or
something similar. I haven't tried to find such a trick, mostly because we've
never encountered a pathological case when using optimized non-greedy matching
that would make it relevant.
So you suspect that there might be an algorithm that does not perform
backtracking? I can not imagine O(N*M) as worst case unless we can kick
backtracking out, although it might well be an average case. But that's
just a feeling and it may be wrong.
In any case, the computational complexity of glob-style matches is clearly
less
than that of regexp, which is hardly surprising given that regexps require
a NFA. This is why we disable regexp support by default in our implementation.
That's what I wonder about. The NFA can be converted to a DFA,
backtracking being an alternative, but that same appears to be true for
wildcard patterns to me, if multiple stars are allowed.
Perhaps it makes sense to recommend a non-greedy approach in the security
considerations section of the base specification? Allowing implementations to
limit the number of *'s they allog in glob-style patterns might also make
sense, if only to catch cases where users used :matches when they intended to
use :contains.
If we can not come up with a proof for worst-case complexity of non-greedy
algorithms, then I think we should document that pattern matching may
require time exponential to the number of stars in the pattern and indeed
allow implementations to put a limit on pattern matching, generating an
error if the limit is exceeded. The implementation should be free to express
the limit in the number of stars, the number of performed backtracking
operations, used CPU time or whatever else.
Interestingly, the PCRE library already contains functions to do that
for regular expressions, which is why the Exim filter language (quite
similar to Sieve) can offer regular expressions at no risk.
Michael