This isn't a trivial task, so you may or may not get someone to give
you a working solution for free.....
One way to tackle this is to:
- tokenize the search string into individual words
- mark up those individual works in the document
- identify sequences of that markup
- replace the sequences with the replacement markup
Yes, it's definitely challenging. Reading the problem and Andrew's
solution makes me realise that this is an example of the class of
problems which Michael Jackson (of Jackson Structured Programming fame)
calls "boundary clash" problems. In the markup field these tend to be
described as "overlap" problems. You have two hierarchies in the
document - the element hierarchy and the sentence/word/character
hierarchy, and they overlap in the sense that the boundaries in one
hierarchy don't coincide with those in the other. The technique, at a
very high level of abstraction, is to rearrange the document into the
hierarchy that you want to process, while retaining sufficient
information to reconstitute the other hierarchy when you are done. This
retained information can either be inline (perhaps in the form of
"milestone" tags), or out-of-line (an index of pointers into the text).
Michael Kay
Saxonica
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--