xsl-list
[Top] [All Lists]

Re: [xsl] Splitting a paragraph into sentences and keep markup

2019-11-24 11:50:47
Hi David,

Yes, there shouldn't be any cross-paragraph elements.

Rick

-----Original Message-----
From: David Carlisle d(_dot_)p(_dot_)carlisle(_at_)gmail(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> 
Sent: Sunday, November 24, 2019 9:33 AM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Splitting a paragraph into sentences and keep markup

can we assume the easy case (as in your example) where all the sentences end at 
the top level?

a more challenging example is

<root>
    <p>This has one <span class="zzz">sentence? Actually, it has 
<emphasis>two</emphasis>.  No,</span> it has three.</p> </root>

as then you need to force-close any open elements at the sentence end and 
re-open them in the new sentence so something like

  <p>This has one <span class="zzz">sentence?</span></p>
  <p><span class="zzz">Actually, it has <emphasis>two</emphasis>.</span></p>
 <p><span class="zzz">No,</span> it has three.</p>

David

On Sun, 24 Nov 2019 at 13:34, Rick Quatro rick(_at_)rickquatro(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Hi All,



I have a situation where I want to split a short paragraph into sentences and 
use them in different parts of my output. I am using <xsl:analyze-string> 
because I want to account for a sentence ending with a . or ?. This will work 
except if there are any children of the paragaph, like the <emphasis> in the 
second sentence. Can I split a paragraph into sentences and still keep the 
markup?



Here is my input document:



<?xml version="1.0" encoding="UTF-8"?>

<root>

    <p>This has one sentence? Actually, it has 
<emphasis>two</emphasis>. No, it has three.</p>

</root>



My stylesheet:



<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";

    xmlns:xs="http://www.w3.org/2001/XMLSchema";

    xmlns:rq="http://www.frameexpert.com";

    exclude-result-prefixes="xs rq"

    version="2.0">



    <xsl:output indent="yes"/>

    <xsl:strip-space elements="root"/>



    <xsl:template match="/root">

        <xsl:copy>

            <xsl:apply-templates/>

        </xsl:copy>

    </xsl:template>



    <xsl:template match="p">

        <xsl:variable name="sentences" 
select="rq:splitParagraphIntoSentences(.)"/>

        <p><xsl:value-of select="$sentences[1]"/></p>

        <note>Something in between.</note>

        <p><xsl:value-of select="$sentences[position()&gt;1]"/></p>

    </xsl:template>



    <xsl:function name="rq:splitParagraphIntoSentences">

        <xsl:param name="paragraph"/>

        <xsl:analyze-string select="$paragraph" 
regex=".+?[\.\?](\s+|$)">

            <xsl:matching-substring>

                <sentence><xsl:value-of 
select="replace(.,'\s+$','')"/></sentence>

            </xsl:matching-substring>

        </xsl:analyze-string>

    </xsl:function>

</xsl:stylesheet>



My output:



<?xml version="1.0" encoding="UTF-8"?>

<root>

   <p>This has one sentence?</p>

   <note>Something in between.</note>

   <p>Actually, it has two. No, it has three.</p>

</root>



What I want is this:



<?xml version="1.0" encoding="UTF-8"?>

<root>

   <p>This has one sentence? </p>

   <note>Something in between.</note>

   <p>Actually, it has <emphasis>two</emphasis>. No, it has three. 
</p>

</root>



Any suggestions will be appreciated.



Rick

XSL-List info and archive
EasyUnsubscribe (by email)
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>