xsl-list
[Top] [All Lists]

Re: [xsl] Splitting a paragraph into sentences and keep markup

2019-11-24 13:38:34
I think there are two basic approaches to this kind of problem. One is to 
convert the punctuation into tags, and then manipulate the resulting tree 
structure; the other is to turn the embedded tags into punctuation (like 
"[emphasis]two[/emphasis]") and then manipulate the content as a character 
string. My instinct, like Martin Honnen's, is to do the first.

There are still complications, of course. For example if you're detecting 
end-of-sentence as [.?!] followed by a space or end-of-paragraph, then it's 
challenging to handle the case where the [.?!] is the last character in a text 
node but the text node isn't the last thing in the paragraph. (For example 
"sentence.<footnote>x</footnote> "). There's no easy answer to this (and 
natural language being what it is, there is no right answer either).

Michael Kay
Saxonica

On 24 Nov 2019, at 13:34, Rick Quatro rick(_at_)rickquatro(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Hi All,
 
I have a situation where I want to split a short paragraph into sentences and 
use them in different parts of my output. I am using <xsl:analyze-string> 
because I want to account for a sentence ending with a . or ?. This will work 
except if there are any children of the paragaph, like the <emphasis> in the 
second sentence. Can I split a paragraph into sentences and still keep the 
markup?
 
Here is my input document:
 
<?xml version="1.0" encoding="UTF-8"?>
<root>
    <p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No, 
it has three.</p>
</root>
 
My stylesheet:
 
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform 
<http://www.w3.org/1999/XSL/Transform>"
    xmlns:xs="http://www.w3.org/2001/XMLSchema 
<http://www.w3.org/2001/XMLSchema>"
    xmlns:rq="http://www.frameexpert.com <http://www.frameexpert.com/>"
    exclude-result-prefixes="xs rq"
    version="2.0">
    
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="root"/>
    
    <xsl:template match="/root">
        <xsl:copy>
            <xsl:apply-templates/>
        </xsl:copy>
    </xsl:template>
    
    <xsl:template match="p">
        <xsl:variable name="sentences" 
select="rq:splitParagraphIntoSentences(.)"/>
        <p><xsl:value-of select="$sentences[1]"/></p>
        <note>Something in between.</note>
        <p><xsl:value-of select="$sentences[position()&gt;1]"/></p>
    </xsl:template>
    
    <xsl:function name="rq:splitParagraphIntoSentences">
        <xsl:param name="paragraph"/>
        <xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)">
            <xsl:matching-substring>
                <sentence><xsl:value-of 
select="replace(.,'\s+$','')"/></sentence>
            </xsl:matching-substring>
        </xsl:analyze-string>
    </xsl:function>
</xsl:stylesheet>
 
My output:
 
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <p>This has one sentence?</p>
   <note>Something in between.</note>
   <p>Actually, it has two. No, it has three.</p>
</root>
 
What I want is this:
 
<?xml version="1.0" encoding="UTF-8"?>
<root>
   <p>This has one sentence? </p>
   <note>Something in between.</note>
   <p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p>
</root>
 
Any suggestions will be appreciated.
 
Rick
XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by 
email <>)
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>