"D. J. Bernstein" <djb(_at_)cr(_dot_)yp(_dot_)to> wrote:
A byte-by-byte regexp matcher that doesn't know anything about UTF-8,
such as an ancient version of the UNIX grep program, nevertheless does
a perfect job of matching a UTF-8 regexp against a UTF-8 string.
I think it will never match something that shouldn't match, which is
indeed a pretty cool feature of UTF-8, but it will sometimes fail to
match something that should match. For example, the regexp foo.bar will
fail to match foo±bar (because the ± character is two bytes).
AMC