perl-unicode

Re: Unicode decomposition and search engines

1999-11-15 11:32:43
Benjamin Franz wrote on 1999-11-15 17:11 UTC:
Decomposed characters make life hell for search engines. I (and I suspect
most) search engine authors do their best to store them combined as the
canonical form.

Not that much more hell then good search engine authors do already have
to go through with Latin-1: case unification, punctuation elimination,
word stemming, handling of alternative letter spellings (e.g. the German
ae=ä and ss=ß, the Dutch aa=å, the English o=ô), etc. If you view the
decomposition that a search engine has to do in terms of the levels of
the Unicode/ISO sorting algorithm, then removing the ambiguity between
uppercase and lowercase characters is just the first step of the three
or four decomposition steps that are necessary before Unicode strings
can be compared meaningfully.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>