xsl-list
[Top] [All Lists]

Re: [xsl] marking up text when term from other file is found

2010-04-22 01:21:54
I would try to solve this as, following:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                       version="2.0">

  <xsl:output method="xml" indent="yes" />

  <xsl:variable name="index-terms" select="document('indexTerms.xml')" />

  <xsl:template match="node() | @*">
    <xsl:copy>
          <xsl:apply-templates select="node() | @*" />
        </xsl:copy>
  </xsl:template>

  <xsl:template match="text()" priority="10">
         <xsl:analyze-string select="."
                             regex="{string-join(for $term in
$index-terms/terms/term return concat('(', $term, ')'), '|')}">
            <xsl:matching-substring>
                 <xsl:variable name="idVal" select="string-join(for $attrVal in
$index-terms/terms/term[. =
regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')"
/>
                 <ph id="{$idVal}">
                     <xsl:value-of select="." />
                 </ph>
           </xsl:matching-substring>
           <xsl:non-matching-substring>
               <xsl:value-of select="." />
           </xsl:non-matching-substring>
         </xsl:analyze-string>
  </xsl:template>

</xsl:stylesheet>

You may adapt this, to suit your requirements if needed.

On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton
<hoskgret(_at_)rochester(_dot_)rr(_dot_)com> wrote:

HI, I need help finding resources (examples and/or XSL) for this situation,
for which I haven't found quite the right recipe in my searches of the list
archives.
Given an XML file containing a list of terms and another file containing a
mix of elements containing text (narrative content, some inline markup for
emphasis and footnotes), I was asked if I could find occurrences of each
term wherever it appeared in the narrative content, and wrap each occurrence
with a tag. So my first thought is to load up each document into a variable.
But then I don't know what the most effective method of string comparison
would be, given that the narrative document might have the term's words with
different capitalization. If anyone can point me in the right direction, I'd
appreciate it. Also I would like to know if there is a practical limit to
how large a narrative file I can run with about 150 terms to find in the
 text. And if a different approach  would work better, such as writing Java
to do  brute force search and replace, please tell me so. (I work with a
Java programmer. Everything looks like a Java problem to her and an XSL
problem to me.)
-- Dorothy
Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad
sentence as an example.
Example of terms (indexTerms.xml):
<?xml version="1.0" encoding="UTF-8"?>
<terms>
  <term index1="anxiety">Anxiety</term>
  <term index1="children">Children</term>
  <term index1="children" index2="illness">Children, illness</term>
  <term index1="children" index2="nightmare">Children, nightmare</term>
  <term index1="cure" index2="fever">Cure fever</term>
  <term index1="cure" index2="illness">Cure illness</term>
  <term index1="anxiety" index2="nightmare">Nightmare</term>
  <term index1="children" index2="illness">Sick children</term>
  <term index1="anxiety" index2="phobia">Worries, phobias and anxiety</term>
</terms>

Example of narrative (sampleTopic.xml):
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
"http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
<topic id="sampleTopic">
 <title>sampleTopic</title>
 <body>
   <p>markup for sample terms testing a set of phrases to match to the
content of index terms:</p>
   <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e. <ph
id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
children</ph> and sometime the same terms occur, <i>but different case</i>,
not in a ph: Curing fever and <b>Sick children</b>. I need to get all the
occurrences of each of the term element strings marked up with &lt;ph&gt;
</p>
 </body>
</topic>

Desired result:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
"http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
<topic id="sampleTopic">
 <title>sampleTopic</title>
 <body>
   <p>markup for sample terms testing a set of phrases to match to the
content of index terms:</p>
   <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e. <ph
id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
children</ph> and sometime the same terms occur, <i>but different case</i>,
not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph
id="children_illness">Sick children</ph></b>. I need to get all the
occurrences of each of the term element strings marked up with &lt;ph&gt;
</p>
 </body>
</topic>

XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
version="2.0">
<xsl:param name="indexFile">indexTerms.xml</xsl:param>
<xsl:param name="textFile">sampleTopic.xml</xsl:param>
<xsl:variable name="termsDocument"
select="document($indexFile)"></xsl:variable>
<xsl:variable name="textDocument"
select="document($textFile)"></xsl:variable>
<xsl:template match="*" name="test1"><xsl:result-document
href="matchText-test.xml" method="xml">
<!-- proof that I can get the terms -->
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>first term is
</xsl:text><xsl:value-of
select="$termsDocument/terms/term[1]"/></xsl:comment>
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>second term is
</xsl:text><xsl:value-of
select="$termsDocument/terms/term[2]"/></xsl:comment>
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>third term is
</xsl:text><xsl:value-of
select="$termsDocument/terms/term[3]"/></xsl:comment>
<!-- now how to I find them in the $textDocument elements and mark them up?
-->
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>



-- 
Regards,
Mukul Gandhi

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--