xsl-list
[Top] [All Lists]

[xsl] Combining consecutive siblings

2009-07-29 21:57:20
I'm trying to post-process the HTML produced via Adobe Acrobat's PDF export. (Actually, XHTML via Tidy from Acrobat's HTML 4.01.) Acrobat does something very funky with end-of-line hyphens that it deems "soft", namely wrapping the preceding and following text nodes inside a styled <span> and removing the hyphen. To simplify the situation, if the input text was

The volumes of the Docu-
mentary History of the Rati-
fication of the Consitution are heavy.

the output would be something like

<p>The volumes of the <i>Docu</i><i>mentary
History of the Rati</i><i>cation of the Constitution</i>
are heavy.</p>

Now there are various reasons why it would be nice to transform these constructs so that all consecutive <i> elements are wrapped in a single element. I've come up with the following XSLT 2.0 templates that rely on the '>>' operator to group consecutive sibling <i>'s for processing. It works on some sample data, but it is a risky transform because if the logic is not perfect, there could be dropped <i>'s. Can anyone see a potential case where this would fail?

   <xsl:template match="i">
      <xsl:choose>
         <xsl:when test="preceding-sibling::node()[1][self::i]">
            <!-- omit, the next when-clause handles me -->
         </xsl:when>
         <xsl:when test="following-sibling::node()[1][self::i]">
            <xsl:variable name="stopNode"
               select="following-sibling::node()[not(self::i)][1]"/>
            <xsl:copy>
               <xsl:apply-templates/>
               <xsl:apply-templates
                  select="following-sibling::i[not(. &gt;&gt; $stopNode)]"
                  mode="copy"/>
            </xsl:copy>
         </xsl:when>
         <xsl:otherwise>
            <xsl:copy><xsl:apply-templates/></xsl:copy>
         </xsl:otherwise>
      </xsl:choose>

   </xsl:template>
   <xsl:template match="i" mode="copy">
      <xsl:apply-templates/>
   </xsl:template>

DS

--
David Sewell, Editorial and Technical Manager
ROTUNDA, The University of Virginia Press
PO Box 801079, Charlottesville, VA 22904-4318 USA
Courier: 310 Old Ivy Way, Suite 302, Charlottesville VA 22903
Email: dsewell(_at_)virginia(_dot_)edu   Tel: +1 434 924 9973
Web: http://rotunda.upress.virginia.edu/

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>