perl-unicode

RE: Editing, cursor motion, and combining characters

1999-11-15 10:50:22
Decomposed characters make life hell for search engines. I 
(and I suspect
most) search engine authors do their best to store them 
combined as the
canonical form.

Normalisation form C (see UTR 15,
http://www.unicode.org/unicode/reports/tr15/) is fine.  Note
that that does not always leave the text free from combining
characters, not even for Latin texts.

If you're making search engines, you may also wish to consider
UTR 10 (http://www.unicode.org/unicode/reports/tr10/) and ISO/IEC
FCD 14651 which is very closely related, and any tailorings
of the tables these reports refer to.  This is still in the making,
especially any tailorings.  Users may wish to be able to do a
level 1, level 2, or level 3 search (there is also a level 4
which is not required).  In short: level 1 ignores case and
accents, level 2 ignores case but is accent sensitive, and
level 3 is accent and case sensitive.

                Kind regards
                /kent k