Re: [xsl] Matching string values across element boundaries

David - I think the answer whether there is any improvement to be madeto your system will depend in detail on just how the matching algorithmworks. Clearly if it expects a string, you have to give it one, and youare left with something like your approach. If you're willing torevisit the matching algorithm (I expect you don't want to - it soundshairy), you could probably also change the markup generation. One ideathat springs to mind is the highlighters available in search platforms:these typically operate on text only, remembering the position of everyword, and allow you to mark them with tags in a highlighting pass, whichyou can later coalesce using XSLT or some other markup-aware process.If you can cast the matching problem as a search problem, you couldleverage MarkLogic, or Lucene or something like that. Maybe that wouldbe better than what you have, I don't know.


-Mike

On 4/8/2013 2:15 PM, David Sewell wrote:

I expect this has been discussed here before, but I can't locate any relevant
discussion, so here goes.

We have input data with many unmarked short-title citations that look like this:

    Sprague, <hi rend="italic">Braintree Families</hi>

We want to wrap them inside another element, in our case a <ref> to the
bibliographic expansion. We have a venerable chain of XSLT 2.0 transforms that
does this, and pretty well, by preprocessing the data to convert all those <hi>
tags into a pair of unique ASCII characters, so that we can do string-matching
operations within a single text node that now includes something like

    Sprague, ¢Braintree Families¥

which is easy to handle with xsl:analyze-string. then once we've wrapped all the
strings we need to, we post-process with xsl:analyze-string to put the <hi>
elements back in.

In practice, given the proper regexes, this works quite well and provides the
desired output, but I always feel a bit guilty about the hackishness of the
approach. Given that the citations are quite variable in structure (usually but
not always containing <hi> elements, with various combinations of text nodes at
start and end), I've never come up with a good general-purpose way to operate
purely on elements and text nodes without the convert-tags-to-characters step.
Is there one (or more)?

David S.



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--