xsl-list
[Top] [All Lists]

Re: [xsl] Splitting text nodes - xsl:iterate?

2014-11-12 12:18:57
Tom Cleghorn tcleghorn(_at_)cambridge(_dot_)org wrote:

Given an input document looking something like this:
<doc>
   <head><foo/><bar/><baz/></head>
   <body>
     <sec>
       <para>Lorem ipsum dolor sit amet, consectetur adipiscing
elit.<box outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum
urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis leo sit amet, ornare
consequat</qux> metus. Ut vel massa congue, egestas nibh et, rutrum
odio.</para>
     </sec>
   </body>
</doc>

(i.e. document markup consisting of arbitrary text and element nodes
nested to some unknown depth)

and the requirement for two separate outputs looking like these:
<doc>
   <head><foo/><bar/><baz/></head>
   <body>
     <sec>
       <para><new:start/>Lorem ipsum dolor sit amet, consectetur
adipiscing elit.<box outline="maybe"><para quack="y">Proin id <?foo
bar?>bibendum urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis <new:end/>leo sit amet,
ornare consequat</qux> metus. Ut vel massa congue, egestas nibh et,
rutrum odio.</para>
     </sec>
   </body>
</doc>

<sec>
   <para>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<box
outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum urna,
<baz>ut ornare</baz> mi.</para></box></para>
   <para>Aenean dui risus, <qux>sodales quis [...]</qux></para>
</sec>

(i.e. a copy of the input, with new:start and new:end elements marking
the first 20 words of the document; and separately a copy of those first
twenty words, preserving all markup within them and adding ellipses at
the end)

I tried the following with Saxon 9.6 PE:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:xs="http://www.w3.org/2001/XMLSchema";
  xmlns:xf="http://www.w3.org/2005/xpath-functions";
  xmlns:new="http://example.com/new";
  exclude-result-prefixes="xs xf">

<xsl:param name="size" as="xs:integer" select="20"/>

<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

<xsl:param name="file-name" as="xs:string" select="'test2014111202Text.xml'"/>

<xsl:variable name="start-node" as="text()?" select="descendant::text()[normalize-space()][1]"/>

<xsl:variable name="end-node" as="text()?"
select="descendant::text()[normalize-space() and matches(string-join((preceding::text()[normalize-space()], .), ''), $regex)][1]"/>

<xsl:variable name="end-words" as="xs:string?"

select="replace(string-join(($end-node/preceding::text()[normalize-space()], $end-node), ''), $regex, '')"/>

<xsl:template match="/">

  <xsl:variable name="d1">
    <xsl:apply-templates/>
  </xsl:variable>

  <xsl:copy-of select="$d1"/>

  <xsl:result-document href="{$file-name}">
    <xsl:variable name="split" select="$d1//new:end"/>
<xsl:variable name="copy" select="$split/(ancestor-or-self::node() | preceding::node())"/>
    <xsl:apply-templates select="($copy//sec)[1]" mode="sep">
      <xsl:with-param name="nodes" select="$copy" tunnel="yes"/>
    </xsl:apply-templates>
  </xsl:result-document>

</xsl:template>

<xsl:template match="node()" mode="sep">
  <xsl:param name="nodes" tunnel="yes"/>
  <xsl:if test=". intersect $nodes">
    <xsl:copy>
      <xsl:apply-templates select="@* , node()" mode="sep"/>
    </xsl:copy>
  </xsl:if>
</xsl:template>

<xsl:template match="new:start" mode="sep"/>

<xsl:template match="new:end" mode="sep">
  <xsl:text>[...]</xsl:text>
</xsl:template>

<xsl:template match="@* | node()">
  <xsl:copy>
    <xsl:apply-templates select="@* , node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="$start-node" priority="5">
  <new:start/>
  <!-- would like
  <xsl:next-match/>
to either use the identity transformation template if start-node and $end-node are different
       or the template below if they are the same
       but ran into a problem with Saxon 9.6 PE
  -->
  <xsl:value-of select="."/>
</xsl:template>

<xsl:template match="$end-node">
  <xsl:value-of select="substring-before(., $end-words)"/>
  <new:end/>
  <xsl:value-of select="$end-words"/>
</xsl:template>

</xsl:stylesheet>


I think it produces the output you want for the input you posted but I have not tried it on other samples. Obviously part of the approach is writing a regular expression that identifies the "words", I used

<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

which works on your sample but would fail for instance if the first text nodes with words starts with white space or punctuation characters.
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>