Keith Moore writes:
existing expression matchers seem unlikely to do useful things with utf-8
You couldn't possibly be more wrong.
A byte-by-byte regexp matcher that doesn't know anything about UTF-8,
such as an ancient version of the UNIX grep program, nevertheless does
a perfect job of matching a UTF-8 regexp against a UTF-8 string.
The relevant features of UTF-8 are that (1) it's compatible with ASCII,
so characters such as * are the same in ASCII and UTF-8; and (2) it's
self-synchronizing, so a UTF-8 character cannot match a UTF-8 string
except at a character boundary.
---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago