[Top] [All Lists]

Re: [xsl] Splitting text nodes - xsl:iterate?

2014-11-12 12:18:57
Tom Cleghorn tcleghorn(_at_)cambridge(_dot_)org wrote:

Given an input document looking something like this:
       <para>Lorem ipsum dolor sit amet, consectetur adipiscing
elit.<box outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum
urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis leo sit amet, ornare
consequat</qux> metus. Ut vel massa congue, egestas nibh et, rutrum

(i.e. document markup consisting of arbitrary text and element nodes
nested to some unknown depth)

and the requirement for two separate outputs looking like these:
       <para><new:start/>Lorem ipsum dolor sit amet, consectetur
adipiscing elit.<box outline="maybe"><para quack="y">Proin id <?foo
bar?>bibendum urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis <new:end/>leo sit amet,
ornare consequat</qux> metus. Ut vel massa congue, egestas nibh et,
rutrum odio.</para>

   <para>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<box
outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum urna,
<baz>ut ornare</baz> mi.</para></box></para>
   <para>Aenean dui risus, <qux>sodales quis [...]</qux></para>

(i.e. a copy of the input, with new:start and new:end elements marking
the first 20 words of the document; and separately a copy of those first
twenty words, preserving all markup within them and adding ellipses at
the end)

I tried the following with Saxon 9.6 PE:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  exclude-result-prefixes="xs xf">

<xsl:param name="size" as="xs:integer" select="20"/>

<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

<xsl:param name="file-name" as="xs:string" select="'test2014111202Text.xml'"/>

<xsl:variable name="start-node" as="text()?" select="descendant::text()[normalize-space()][1]"/>

<xsl:variable name="end-node" as="text()?"
select="descendant::text()[normalize-space() and matches(string-join((preceding::text()[normalize-space()], .), ''), $regex)][1]"/>

<xsl:variable name="end-words" as="xs:string?"

select="replace(string-join(($end-node/preceding::text()[normalize-space()], $end-node), ''), $regex, '')"/>

<xsl:template match="/">

  <xsl:variable name="d1">

  <xsl:copy-of select="$d1"/>

  <xsl:result-document href="{$file-name}">
    <xsl:variable name="split" select="$d1//new:end"/>
<xsl:variable name="copy" select="$split/(ancestor-or-self::node() | preceding::node())"/>
    <xsl:apply-templates select="($copy//sec)[1]" mode="sep">
      <xsl:with-param name="nodes" select="$copy" tunnel="yes"/>


<xsl:template match="node()" mode="sep">
  <xsl:param name="nodes" tunnel="yes"/>
  <xsl:if test=". intersect $nodes">
      <xsl:apply-templates select="@* , node()" mode="sep"/>

<xsl:template match="new:start" mode="sep"/>

<xsl:template match="new:end" mode="sep">

<xsl:template match="@* | node()">
    <xsl:apply-templates select="@* , node()"/>

<xsl:template match="$start-node" priority="5">
  <!-- would like
to either use the identity transformation template if start-node and $end-node are different
       or the template below if they are the same
       but ran into a problem with Saxon 9.6 PE
  <xsl:value-of select="."/>

<xsl:template match="$end-node">
  <xsl:value-of select="substring-before(., $end-words)"/>
  <xsl:value-of select="$end-words"/>


I think it produces the output you want for the input you posted but I have not tried it on other samples. Obviously part of the approach is writing a regular expression that identifies the "words", I used

<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

which works on your sample but would fail for instance if the first text nodes with words starts with white space or punctuation characters.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com

<Prev in Thread] Current Thread [Next in Thread>