perl-unicode

Re: possible regexp feature for 5.6: "ignore diacritics"

1999-10-18 02:47:08

    Jarkko> This concept is handy when matching for diacritic-laden variants
    Jarkko> of non-ASCII encodings.  For example finding "bär" when matching
    Jarkko> with "bar" would often be most convenient.  The concept is not
    Jarkko> limited for Western alphabets, it works also on
    Jarkko> Cyrillic/Greek/Hebrew/Arabic/...  alphabets.

We use a specific variation of this feature in our own Unicode regexp support
extensively for doing multilingual concordancing: we allow the users to ignore
non-spacing characters.  This works for us because we generally work with
Unicode text in decomposed form.

In Perl, there is no luxury of assuming the text is always decomposed
properly, so equivalence classes could be constructed from the Unicode
Character Database as Jarkko points out.

    Jarkko> Note, however, that to allow people defining their own equivalence
    Jarkko> classes (somebody may want to add, change, or remove some class
    Jarkko> definitions, if they don't match his particular tastes,
    Jarkko> conventions, or languages.)  This, of course, opens up the taint
    Jarkko> gates...

User-level equivalence classes should perhaps be restricted to the familiar
"[xxxxxxx]."  This keeps the messy details out of the syntax.

    Jarkko> In addition to the POSIX 1003.2 notation,[=c=], I think we could
    Jarkko> allow for a new regexp flag to turn the "ignore diacritics", just
    Jarkko> like we have "ignore case".  Maybe /d?  (/e would have been be
    Jarkko> nice but s///e preëmpted us.)

This would indeed be nice to have.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            The first virtue is to restrain the tongue;
New Mexico State University       he approaches nearest to the gods who knows
Box 30001, Dept. 3CRL             how to be silent, even though he is in the
Las Cruces, NM  88003             right.    -- Cato the Younger (95-46 B.C.E)