perl-unicode

Re: Editing, cursor motion, and combining characters

1999-11-15 11:05:23

    Benjamin> On Mon, 15 Nov 1999, Mark Leisher wrote:
    >> To support these behaviors, the editing system should store the text in
    >> decomposed form.  Though it complicates some aspects of the editor
    >> design, it is useful for other reasons.

    Benjamin> Decomposed characters make life hell for search engines. I (and
    Benjamin> I suspect most) search engine authors do their best to store
    Benjamin> them combined as the canonical form.

One of our big areas is multi-lingual IR, and experience disagrees with your
conclusion.  We have found that while it complicates the code a little, it
makes searching a lot easier (I will send out the URL to a IUC14 paper we did
in this area).  Among other things, we have to decompose to construct correct
regular expressions, and for various activities, we need the decomposed form
to selectively include or exclude non-spacing marks, to keep the size of
indexes smaller for example.  As stated in the last message, the level of
composition affects rendering and editing behaviors as well.

Note that it is not unreasonable to keep things in composed form, it just
increases the costs a little bit in other areas of the implementation.
-----------------------------------------------------------------------------
Mark Leisher
Computing Research Lab            I have never made but one prayer to God,
New Mexico State University       a very short one:
Box 30001, Dept. 3CRL                 "Oh Lord, make my enemies ridiculous."
Las Cruces, NM  88003             And God granted it.  -- Voltaire, letter