xsl-list
[Top] [All Lists]

Re: [xsl] Splitting a paragraph into sentences and keep markup

2019-11-24 11:15:23
There’s a package for splitting at arbitrarily deeply nested nodes. It is part of a paper that I presented at XML Prague this year: https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=347

The package itself is at https://subversion.le-tex.de/common/presentations/2019-02-09_xmlprague_xslt-upward-projection/lib/split.xsl

Using this package, Martin's p-matching template becomes:

<xsl:template match="p[node()]">
  <xsl:variable name="p-with-markers" as="element(p)">
    <xsl:apply-templates select="." mode="insert-marker"/>
  </xsl:variable><!-- this hasn't changed -->
  <xsl:variable name="chunks" as="document-node(element(split:chunks))">
    <xsl:apply-templates select="$p-with-markers"
      mode="split:split-entrypoint"><!-- mode provided by
        lib/split.xsl -->
      <xsl:with-param name="group-start-exp" as="xs:string"
        select="'self::eos'"/><!-- Will be evaluated as an XPath
expression for each node in a for-each-group[@group-starting-with] population. If a population node satisfies the expression, it will
          start a group.-->
      <xsl:with-param name="keep-splitting-node" as="xs:boolean"
        select="false()"/><!-- remove <eos/> after splitting -->
    </xsl:apply-templates>
  </xsl:variable>
  <xsl:copy-of select="$chunks/split:chunks/split:chunk/p[node()]"
    copy-namespaces="no"/>
</xsl:template>

The complete stylesheet is at https://gist.github.com/gimsieke/529dab000386a45d6136e850a80ac726

Applying it to your input, David, will yield:

<?xml version="1.0" encoding="UTF-8"?><root>
<p>This has one <span class="zzz">sentence? </span></p><p><span class="zzz">Actually, it has <emphasis>two</emphasis>. </span></p><p><span class="zzz">No,</span> it has three.</p>
</root>

Gerrit


On 24.11.2019 15:32, David Carlisle 
d(_dot_)p(_dot_)carlisle(_at_)gmail(_dot_)com wrote:
can we assume the easy case (as in your example) where all the
sentences end at the top level?

a more challenging example is

<root>
     <p>This has one <span class="zzz">sentence? Actually, it has
<emphasis>two</emphasis>.  No,</span> it has three.</p>
</root>

as then you need to force-close any open elements at the sentence end
and re-open them in the new sentence so something like

   <p>This has one <span class="zzz">sentence?</span></p>
   <p><span class="zzz">Actually, it has <emphasis>two</emphasis>.</span></p>
  <p><span class="zzz">No,</span> it has three.</p>

David

On Sun, 24 Nov 2019 at 13:34, Rick Quatro rick(_at_)rickquatro(_dot_)com
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Hi All,



I have a situation where I want to split a short paragraph into sentences and use them in 
different parts of my output. I am using <xsl:analyze-string> because I want to 
account for a sentence ending with a . or ?. This will work except if there are any 
children of the paragaph, like the <emphasis> in the second sentence. Can I split a 
paragraph into sentences and still keep the markup?



Here is my input document:



<?xml version="1.0" encoding="UTF-8"?>

<root>

     <p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No, it has 
three.</p>

</root>



My stylesheet:



<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";

     xmlns:xs="http://www.w3.org/2001/XMLSchema";

     xmlns:rq="http://www.frameexpert.com";

     exclude-result-prefixes="xs rq"

     version="2.0">



     <xsl:output indent="yes"/>

     <xsl:strip-space elements="root"/>



     <xsl:template match="/root">

         <xsl:copy>

             <xsl:apply-templates/>

         </xsl:copy>

     </xsl:template>



     <xsl:template match="p">

         <xsl:variable name="sentences" 
select="rq:splitParagraphIntoSentences(.)"/>

         <p><xsl:value-of select="$sentences[1]"/></p>

         <note>Something in between.</note>

         <p><xsl:value-of select="$sentences[position()&gt;1]"/></p>

     </xsl:template>



     <xsl:function name="rq:splitParagraphIntoSentences">

         <xsl:param name="paragraph"/>

         <xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)">

             <xsl:matching-substring>

                 <sentence><xsl:value-of 
select="replace(.,'\s+$','')"/></sentence>

             </xsl:matching-substring>

         </xsl:analyze-string>

     </xsl:function>

</xsl:stylesheet>



My output:



<?xml version="1.0" encoding="UTF-8"?>

<root>

    <p>This has one sentence?</p>

    <note>Something in between.</note>

    <p>Actually, it has two. No, it has three.</p>

</root>



What I want is this:



<?xml version="1.0" encoding="UTF-8"?>

<root>

    <p>This has one sentence? </p>

    <note>Something in between.</note>

    <p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p>

</root>



Any suggestions will be appreciated.



Rick
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>