Re: [xsl] How to split text element to separate spans?

Liam and Gerrit: Thank you very much for your input,ideas and explanations.
I have many things to catch up in XSLT in order to understand this
code, but I'll try.
Thanks again, Israel


On Tue, Jun 8, 2010 at 2:28 AM, Imsieke, Gerrit, le-tex
<gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de> wrote:

Dear Israel,

I once wrote a generic splitting routine where you can split at arbitrary
XPath expressions, at any depth. It uses saxon:evaluate, though, and is too
complicated to be instructive here. So I tried to simplify it, below.

Let's consider this input:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<doc>
<p dir="ltr"><span class="smaller">text1
           <br />
            text2
           text3.
           <br />
           </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
           <br />


           <br /></span></p>

<p dir="ltr"><br/><span class="smaller">text1
           <br />
            <span class="reallytiny">text2 <br /></span>
           text3.
           <br />
           </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
           <br />


           <br /></span></p>

<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

The first p contains your original input, the second p contains a br within
*nested* spans (and a br immediately below p), and the third one doesn't
contain a br.

Applying the stylesheet quoted below, we'll arrive at this output:

=========8<-------------------

<?xml version="1.0" encoding="UTF-8"?><doc>
<p dir="ltr"><span class="smaller">text1
           </span><br/><span class="smaller">
            text2
           text3.
           </span><br/><span class="smaller">
           </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
           </span><br/><span class="smaller">


           </span><br/></p>

<p dir="ltr"><br/><span class="smaller">text1
           </span><br/><span class="smaller">
            <span class="reallytiny">text2 </span></span><br/><span
class="smaller">
           text3.
           </span><br/><span class="smaller">
           </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
           </span><br/><span class="smaller">


           </span><br/></p>

<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

You might find it dissatisfying that the XML code doesn't look as
pretty-printed as your desired output. In order to arrive at an output as
neat as specified, you will need to apply three more passes of whitespace
extraction/normalization (left, right, middle) to the top-level spans. If
you really have to pretty-print the XML in such a way, I will send you the
complete stylesheet.

So here's the version that does just the splitting:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<xsl:transform
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
 xmlns:my="my"
 version="2.0"
 exclude-result-prefixes="my">

 <xsl:output method="xml" indent="no" />

 <!-- Default identity transform: -->
 <xsl:template match="@* | *">
   <xsl:copy>
     <xsl:apply-templates select="@* | node()"/>
   </xsl:copy>
 </xsl:template>

 <xsl:template match="p/span">
   <xsl:sequence select="my:split-at-br(.)"/>
 </xsl:template>


 <!-- split-at-br is intended for
         <p>foo<br/>bar</p>
      -> <p>foo</p><br/><p>bar</p> -->
 <xsl:function name="my:split-at-br" as="element(*)+">
   <xsl:param name="top" as="element(*)" />
   <!-- group adjacent leaves (text nodes, empty elements) which are not br:
-->
   <xsl:for-each-group
     select="$top//node()[ count(node()) = 0 ]"
     group-adjacent="not(self::br)">
     <xsl:choose>
       <xsl:when test="current-grouping-key()">
         <!-- output the top element and its subtree, restricted to
              all ancestors of the current leaf group and the current leaf
group itself: -->
         <xsl:apply-templates select="$top" mode="split">
           <xsl:with-param name="restricted-to" select="current-group()"
tunnel="yes"/>
         </xsl:apply-templates>
       </xsl:when>
       <xsl:otherwise>
         <br/>
       </xsl:otherwise>
     </xsl:choose>
   </xsl:for-each-group>
 </xsl:function>

 <xsl:template match="*" mode="split">
   <xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
   <!-- Only process this element if it's within the restriction group
        or its members' ancestors: -->
   <xsl:if test="generate-id(.) = (
                   for $n in $restricted-to
                   return (
                     for $a in $n/ancestor-or-self::*
                     return generate-id($a)
                   )
                 )">
     <xsl:copy>
       <xsl:copy-of select="@*"/>
       <xsl:apply-templates mode="#current">
         <xsl:with-param name="restricted-to" select="$restricted-to"
tunnel="yes"/>
       </xsl:apply-templates>
     </xsl:copy>
   </xsl:if>
 </xsl:template>

 <xsl:template match="node()[count(node()) = 0]" mode="split">
   <xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
   <xsl:if test="generate-id(.) = (for $n in $restricted-to return
generate-id($n))">
     <xsl:copy-of select="." />
   </xsl:if>
 </xsl:template>

</xsl:transform>

=========8<-------------------

(Please note that I called it xsl:transform instead of xsl:stylesheet, as a
tribute to Roger L. Costello. But that's another thread, a dead thread.)

The stylesheet resp. transformation program does the following:

For each span immediately below a p, call a function that returns multiple
spans, interspersed with br's.

This function works as follows:

Of all descendants of the span, only select the leaves. So if the structure
is
p
 span(1)
   span(2)
     text(a)
     br
     text(b)
   span(3)
     text(c)
it selects the sequence (text(a), br, text(b), text(c)).
Then it groups the sequence according to the criterion that all non-br nodes
should be grouped (and all br nodes, too, as a consequence).
So we now have the following groups:
(text(a)) -- matches the grouping key
(br) -- doesn't match the grouping key
(text(b), text(c)) -- matches the grouping key

For each of the non-br groups, span(1) -- the span to be split at br -- is
processed in mode="split", with the parameter $restricted-to set to the
current group.

So firstly span(1) is being processed in mode="split" with $restricted-to =
(text(a)).
Only if span(1) is among the ancestors of $restricted-to (or among
$restricted-to itself) will its contents be processed.
Its contents will be processed in mode="split", with the same $restricted-to
parameter.
Being an ancestor of text(a), span(2) will be processed, while nothing
happens for span(3).
As a result of processing span(2) in mode="split", $restricted-to =
(text(a)), text(a) will be output.

Going back to for-each-group: the next group is br which will be reproduced
as br, but on the same level as span(1).

So far, our result tree looks like
p
 span(1)
   span(2)
     text(a)
 br

The next group is (text(b), text(c)). But again, span(1) will be processed
in mode="split", now $restricted-to = (text(b) text(c)).
As an ancestor to any of the $restricted-to leaf nodes, span(1) will be
reproduced (the element and its original attributes, not the entire
subtree!).
As ancestors to each of the leaf nodes, both span(2) and span(3) will be
reproduced below span(1).
When processing the subtree of span(2) with the restriction to (text(b),
text(c)), only text(b) will be output. For span(3), only text(c) will be
output.
So finally we have
p
 span(1)
   span(2)
     text(a)
 br
 span(1)
   span(2)
     text(b)
   span(3)
     text(c)

Although it may seem as overkill at first sight, the big advantage of this
approach is that it works well for br within nested spans.

With the generic approach (arbitrary XPath expressions for splitting), you
can extend analyze-string to process markup: in a preparatory pass, use
plain analyze-string on the text nodes to replace the regex with some unique
markup, then use the generic splitting function to split at this markup,
then treat the resulting nodes as you would have treated matching or
non-matching substrings.

-Gerrit


On 07.06.2010 13:36, Israel Viente wrote:


Thank you for your answer Mukul.
It does put the br between the spans but lose the spaces between spans
and replace them with br.

The result of the code you sent gives the following output:

<p dir="ltr"><span class="smaller">text1</span><br /><span
class="smaller">text2 text3.</span><br /><span
class="smalleritalic">no</span><br /><span
class="smaller">problems.</span><br /><br /></p>

The desired one is:


<p dir="ltr"><span class="smaller">text1</span>
           <br />
            <span class="smaller">text2 text3.</span>
           <br />
           <span class="smalleritalic">no</span>  <span
class="smaller">problems.</span>
           <br />
           <br />
           </p>


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--