possible regexp feature for 5.6: "ignore diacritics"


Sarathy said that after the _62 is now out of bag he is in the process
of switching his brain into beta mode, meaning that any suggested new
features are likely to be met with little enthusiasm.  Therefore I'm
bringing this idea to the table now.  The idea has been mentioned
before in passing, but never before explicitly discussed on its own.

In POSIX (1003.2) regexps there is a feature called "equivalence
classes".  What this means that certain characters belong into such
classes, and _any_member_of_a_class_stands_for_any_member_of_the_class_.

This concept is handy when matching for diacritic-laden variants of
non-ASCII encodings.  For example finding "bär" when matching with
"bar" would often be most convenient.  The concept is not limited for
Western alphabets, it works also on Cyrillic/Greek/Hebrew/Arabic/...
alphabets.

If you don't have the POSIX 1003.2 around (sadly enough, I don't, so I
might be speaking in slightly wrong terms), you can take a look at your
tr(1).  If you have a reasonably modern system, tr(1) knows how to do
equivalence classes, based on your locale.

The notation is [=c=], where c is a character (the context equivalence
classes have been previously mentioned has been the POSIX regexp
extensions in general, such as the recently implemented [:class:]
extension).

Now, the problem, or rather, problems, are as follows.  Firstly, there
is no one central definition of which characters belong into which
equivalence classes.  Secondly, there has been no C API defined for
accessing the classes or matching using those classes. I guess a POSIX
1003.2-compliant regex(3) just has to implement them, somehow.

Now, however, for the first problem, we have Unicode.  From its
database one can find (for most) diacritic characters the
"decomposition" (base character + diacritics), and for a few more
character you can find the base, even when the diacritics are a bit
"non-standard".  Therefore it would be possible to do the equivalence
class mapping.  Note: the first definition, using decomposition, is
all safe and sound, it's defined by the Unicode consortium.  The second
definition is my addition, I'm afraid...I implemented it when
I noticed that several clearly "diacritic" characters did not have
a decomposition listed, but one could still deduce such form the _name_
of the character.

I like using Unicode here very much because it rids us of the
nonexistent/missing/broken/conflicting locale definitions by
operating system vendors.

In preparation of understanding equivalence classes, I added for _62
the computation of two tables, lib/unicode/Eq/Latin1 and
lib/unicoce/Eq/Unicode.  They contain lines of "base variant variant ...".
You can try out

perl -pe 's/([\da-f]+)/chr(hex($1))/ieg' lib/unicode/Eq/Latin1

to see the "equivalence classes".  You can do the same for Eq/Unicode
if you have a UTF8-capable terminal.

With these tables, equivalence classes for regular expressions could
be implemented.

Note, however, that to allow people defining their own equivalence
classes (somebody may want to add, change, or remove some class definitions,
if they don't match his particular tastes, conventions, or languages.)
This, of course, opens up the taint gates...

In addition to the POSIX 1003.2 notation,[=c=], I think we could allow
for a new regexp flag to turn the "ignore diacritics", just like we
have "ignore case".  Maybe /d?  (/e would have been be nice but s///e
preëmpted us.)

I am not emotionally *that* deeply attached to the feature, mostly
because I'm really low on tuits, and will be for some time.  But I
know it's a useful concept, and have a fair idea of how it could be
done, and I wanted the idea to be thrown to the table.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen