Tom Cleghorn tcleghorn(_at_)cambridge(_dot_)org wrote:
Given an input document looking something like this:
<doc>
<head><foo/><bar/><baz/></head>
<body>
<sec>
<para>Lorem ipsum dolor sit amet, consectetur adipiscing
elit.<box outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum
urna, <baz>ut ornare</baz> mi.</para></box></para>
<para>Aenean dui risus, <qux>sodales quis leo sit amet, ornare
consequat</qux> metus. Ut vel massa congue, egestas nibh et, rutrum
odio.</para>
</sec>
</body>
</doc>
(i.e. document markup consisting of arbitrary text and element nodes
nested to some unknown depth)
and the requirement for two separate outputs looking like these:
<doc>
<head><foo/><bar/><baz/></head>
<body>
<sec>
<para><new:start/>Lorem ipsum dolor sit amet, consectetur
adipiscing elit.<box outline="maybe"><para quack="y">Proin id <?foo
bar?>bibendum urna, <baz>ut ornare</baz> mi.</para></box></para>
<para>Aenean dui risus, <qux>sodales quis <new:end/>leo sit amet,
ornare consequat</qux> metus. Ut vel massa congue, egestas nibh et,
rutrum odio.</para>
</sec>
</body>
</doc>
<sec>
<para>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<box
outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum urna,
<baz>ut ornare</baz> mi.</para></box></para>
<para>Aenean dui risus, <qux>sodales quis [...]</qux></para>
</sec>
(i.e. a copy of the input, with new:start and new:end elements marking
the first 20 words of the document; and separately a copy of those first
twenty words, preserving all markup within them and adding ellipses at
the end)
I tried the following with Saxon 9.6 PE:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xf="http://www.w3.org/2005/xpath-functions"
xmlns:new="http://example.com/new"
exclude-result-prefixes="xs xf">
<xsl:param name="size" as="xs:integer" select="20"/>
<xsl:variable name="regex" as="xs:string"
select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>
<xsl:param name="file-name" as="xs:string"
select="'test2014111202Text.xml'"/>
<xsl:variable name="start-node" as="text()?"
select="descendant::text()[normalize-space()][1]"/>
<xsl:variable name="end-node" as="text()?"
select="descendant::text()[normalize-space() and
matches(string-join((preceding::text()[normalize-space()], .), ''),
$regex)][1]"/>
<xsl:variable name="end-words" as="xs:string?"
select="replace(string-join(($end-node/preceding::text()[normalize-space()],
$end-node), ''), $regex, '')"/>
<xsl:template match="/">
<xsl:variable name="d1">
<xsl:apply-templates/>
</xsl:variable>
<xsl:copy-of select="$d1"/>
<xsl:result-document href="{$file-name}">
<xsl:variable name="split" select="$d1//new:end"/>
<xsl:variable name="copy" select="$split/(ancestor-or-self::node()
| preceding::node())"/>
<xsl:apply-templates select="($copy//sec)[1]" mode="sep">
<xsl:with-param name="nodes" select="$copy" tunnel="yes"/>
</xsl:apply-templates>
</xsl:result-document>
</xsl:template>
<xsl:template match="node()" mode="sep">
<xsl:param name="nodes" tunnel="yes"/>
<xsl:if test=". intersect $nodes">
<xsl:copy>
<xsl:apply-templates select="@* , node()" mode="sep"/>
</xsl:copy>
</xsl:if>
</xsl:template>
<xsl:template match="new:start" mode="sep"/>
<xsl:template match="new:end" mode="sep">
<xsl:text>[...]</xsl:text>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* , node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="$start-node" priority="5">
<new:start/>
<!-- would like
<xsl:next-match/>
to either use the identity transformation template if start-node
and $end-node are different
or the template below if they are the same
but ran into a problem with Saxon 9.6 PE
-->
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="$end-node">
<xsl:value-of select="substring-before(., $end-words)"/>
<new:end/>
<xsl:value-of select="$end-words"/>
</xsl:template>
</xsl:stylesheet>
I think it produces the output you want for the input you posted but I
have not tried it on other samples. Obviously part of the approach is
writing a regular expression that identifies the "words", I used
<xsl:variable name="regex" as="xs:string"
select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>
which works on your sample but would fail for instance if the first text
nodes with words starts with white space or punctuation characters.
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--