xsl-list
[Top] [All Lists]

Re: [xsl] mixed content, text-based abbreviations to xml

2009-03-06 03:53:46
Hi James,

You can find below 3 transformation steps that get you to the final result. You can eventually combine them into one stylesheet using a micro-pipelining technique (putting the templates in different modes and the results in variables and applying templates in the next mode on the variable from the preceding step).

The first step marks with ex the content in parantheses:

step1.xsl
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="2.0">
  <xsl:template match="* | @* | comment() | processing-instruction()">
    <xsl:copy>
      <xsl:apply-templates select="node() | @*"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:analyze-string select="." regex="\(.+?\)">
      <xsl:matching-substring>
        <ex><xsl:value-of select="translate(., '()', '')"/></ex>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>
</xsl:stylesheet>

giving as result

<?xml version="1.0" encoding="UTF-8"?><p>
    <lb n="1"/>In nomine Domini amen. Ne error obliuionis
    <supplied>geſtis</supplied> ſub tempore
    verſantibus pariat detrimentu<ex>m</ex>. <lb n="2"/>Conuenit, ut actus
    h<supplied>om</supplied>inu<ex>m</ex>
l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex> et teſtium fidedignorum
    <seg>annotac<ex>i</ex>on<ex>e</ex></seg> ad
    poſteritatis noticiam <foo>deducantur <seg>aut int<ex>er</ex>dum</seg>
        ob</foo> scripture vetustatem
    renovent<ex>ur</ex>. Ad perpetuam proinde ...
</p>

The second step marks with fragment the text before and after ex and before supplied

step2.xsl

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="2.0">
  <xsl:variable name="marks" select="'&#10;&#13;,. '"/>

  <xsl:template match="node() | @*">
    <xsl:copy>
      <xsl:apply-templates select="node() | @*"/>
    </xsl:copy>
  </xsl:template>

<xsl:template match="text()[following-sibling::*[1][self::ex or self::supplied] and
    not(translate(substring(., string-length(.)), $marks, '')='')]">
    <xsl:variable name="words" select="tokenize(., '\s')"/>
<xsl:value-of select="substring(., 1, string-length(.)-string-length($words[last()]))"/>
    <fragment><xsl:value-of select="$words[last()]"/></fragment>
  </xsl:template>

  <xsl:template match="text()[preceding-sibling::*[1][self::ex] and
    not(translate(substring(.,1,1), $marks, '')='')]">
    <xsl:variable name="words" select="tokenize(., '\s')"/>
    <fragment><xsl:value-of select="$words[1]"/></fragment>
    <xsl:value-of select="substring(., string-length($words[1]) + 1)"/>
  </xsl:template>
</xsl:stylesheet>

giving as result

<?xml version="1.0" encoding="UTF-8"?><p>
    <lb n="1"/>In nomine Domini amen. Ne error obliuionis
    <supplied>geſtis</supplied> ſub tempore
verſantibus pariat <fragment>detrimentu</fragment><ex>m</ex>. <lb n="2"/>Conuenit, ut actus

<fragment>h</fragment><supplied>om</supplied><fragment>inu</fragment><ex>m</ex>

<fragment>l</fragment><ex>itte</ex><fragment>r</fragment><supplied>ar</supplied><ex>um</ex> et teſtium fidedignorum

<seg><fragment>annotac</fragment><ex>i</ex><fragment>on</fragment><ex>e</ex></seg> ad poſteritatis noticiam <foo>deducantur <seg>aut <fragment>int</fragment><ex>er</ex><fragment>dum</fragment></seg>
        ob</foo> scripture vetustatem
    <fragment>renovent</fragment><ex>ur</ex>. Ad perpetuam proinde ...
</p>

The final step groups the adjacent fragment, supplied and ex nodes and outputs the choice:

step3.xsl
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="2.0">
  <xsl:template match="p|seg">
    <xsl:call-template name="process"/>
  </xsl:template>

  <xsl:template name="process">
    <xsl:copy>
<xsl:for-each-group select="node()" group-adjacent="name() = ('fragment','supplied','ex')">
        <xsl:choose>
<xsl:when test="current-grouping-key() and current-group()/name() = 'ex'">
            <choice>
              <xsl:if test="current-group()/name() = 'supplied'">
<orig><xsl:apply-templates select="current-group()" mode="orig"/></orig>
              </xsl:if>
<abbr><xsl:apply-templates select="current-group()" mode="abbr"/></abbr> <expan><xsl:apply-templates select="current-group()" mode="expan"/></expan>
            </choice>
          </xsl:when>
          <xsl:otherwise>
            <xsl:apply-templates select="current-group()" mode="text"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="seg" mode="text">
    <xsl:call-template name="process"/>
  </xsl:template>
  <xsl:template match="fragment" mode="text">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="node() | @*" mode="text">
    <xsl:copy>
      <xsl:apply-templates select="node() | @*" mode="text"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="ex" mode="orig">
    <am/>
  </xsl:template>
  <xsl:template match="fragment" mode="orig">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="supplied" mode="orig">
    <damage/>
  </xsl:template>

  <xsl:template match="ex" mode="abbr">
    <am/>
  </xsl:template>
  <xsl:template match="fragment" mode="abbr">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="supplied" mode="abbr">
    <xsl:copy-of select="."/>
  </xsl:template>

  <xsl:template match="fragment" mode="expan">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="supplied|ex" mode="expan">
    <xsl:copy-of select="."/>
  </xsl:template>

</xsl:stylesheet>

giving the result you expect

<?xml version="1.0" encoding="UTF-8"?><p>
    <lb n="1"/>In nomine Domini amen. Ne error obliuionis
    <supplied>geſtis</supplied> ſub tempore
verſantibus pariat <choice><abbr>detrimentu<am/></abbr><expan>detrimentu<ex>m</ex></expan></choice>. <lb n="2"/>Conuenit, ut actus

<choice><orig>h<damage/>inu<am/></orig><abbr>h<supplied>om</supplied>inu<am/></abbr><expan>h<supplied>om</supplied>inu<ex>m</ex></expan></choice>

<choice><orig>l<am/>r<damage/><am/></orig><abbr>l<am/>r<supplied>ar</supplied><am/></abbr><expan>l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex></expan></choice> et teſtium fidedignorum

<seg><choice><abbr>annotac<am/>on<am/></abbr><expan>annotac<ex>i</ex>on<ex>e</ex></expan></choice></seg> ad poſteritatis noticiam <foo>deducantur <seg>aut <choice><abbr>int<am/>dum</abbr><expan>int<ex>er</ex>dum</expan></choice></seg>
        ob</foo> scripture vetustatem

<choice><abbr>renovent<am/></abbr><expan>renovent<ex>ur</ex></expan></choice>. Ad perpetuam proinde ...
</p>

Best Regards,
George
--
George Cristian Bina
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com

James Cummings wrote:
[resending after bounce message...because the mailing list doesn't
like google app's different X-MAIL-FROM header...fingers crossed it is
right now.]

Hiya,

I have some XML that has mixed content of markup and text nodes where
I want to process certain words.  The words in the the document are
not already tokenized in any way (and multiple levels of nested markup
ranging from the middle of words makes this difficult).  What I want
to do is process the individual words (some containing or embedded in
markup) and where there is an expansion denoted by parentheses provide
that and the abbreviated form, and if that works, then if there is a
<supplied> element beginning and ending inside the word, replace that
with <damage/> to provide a copy of the original.

If the content is something like:

=====
<p>
   <lb n="1"/>In nomine Domini amen. Ne error obliuionis
<supplied>geſtis</supplied> ſub tempore
   verſantibus pariat detrimentu(m). <lb n="2"/>Conuenit, ut actus
h<supplied>om</supplied>inu(m)
       l(itte)r<supplied>ar</supplied>(um) et teſtium fidedignorum
<seg>annotac(i)on(e)</seg> ad
   poſteritatis noticiam <foo>deducantur <seg>aut int(er)dum</seg>
ob</foo> scripture vetustatem
   renovent(ur). Ad perpetuam proinde ...
</p>
=====

The output should change words containing ( and ) into a nested
structure such as:

input: h<supplied type="damage">om</supplied>inu(m)
output:
<choice>
   <orig>h<damage/>inu<am/></orig>
   <abbr>h<supplied type="damage">om</supplied>inu<am/></abbr>
   <expan>h<supplied type="damage">om</supplied>inu<ex>m</ex></expan>
</choice>

The <orig> is only supplied here because the original word actually
has a <supplied reason="damage"> element that begins/ends inside the
word. (For the full example I've not included the attribute to make it
more readable.)  Words can contain any number of elements such as
<lb/> and <supplied>, as well as the usual whitespace problems.
Abbreviations denoted by parentheses are always only part of an
individual word, though may occur multiple times in a word.

Full output of the above would be something like:
=====
<p>
   <lb n="1"/>In nomine Domini amen. Ne error obliuionis
<supplied>geſtis</supplied> ſub tempore
   verſantibus pariat <choice>
       <abbr>detrimentu<am/></abbr>
       <expan>detrimentu<ex>m</ex></expan>
   </choice>. <lb n="2"/>Conuenit, ut actus <choice>
       <orig>h<damage/>inu<am/></orig>
       <abbr>h<supplied>om</supplied>inu<am/></abbr>
       <expan>h<supplied>om</supplied>inu<ex>m</ex></expan>
   </choice>
   <choice>
       <orig>l<am/>r<damage/><am/></orig>
       <abbr>l<am/>r<supplied>ar</supplied><am/></abbr>
       <expan>l<ex>itte</ex>r<supplied>ar</supplied><ex>um</ex></expan>
   </choice> et teſtium fidedignorum <seg>
       <choice>
           <abbr>annotac<am/>on<am/></abbr>
           <expan>annotac<ex>i</ex>on<ex>e</ex></expan>
       </choice>
   </seg> ad poſteritatis noticiam <foo>deducantur <seg>aut <choice>
               <abbr>int<am/>dum</abbr>
               <expan>int<ex>er</ex>dum</expan>
           </choice>
       </seg> ob</foo> scripture vetustatem <choice>
       <abbr>renovent<am/></abbr>
       <expan>renovent<ex>ur</ex></expan>
       </choice>. Ad perpetuam proinde ...
</p>
=====

The default copying-to-output, choices between things and creating the
different versions of things once I have each word and its
abbreviations tokenized all seems straightforward.  It is getting each
word, without losing any other markup, and knowing where the
abbreviations are that I'm more fuzzy about.  I hate asking for help
before I've got very far, but it is straying into territory I'm not
very familiar with.  I'm guessing that this needs a multi-pass
mode-based stylesheet with xsl:analyze-string to find the parentheses,
but not tokenize() to find the edges of the word but, erm, maybe
xsl:for-each-group? While I found individual bits of this in the FAQ I
didn't find anything doing it all at once.

Any suggestions ranging from pointers in the right direction to
fully-realized solutions gratefully received with promises of a pint
next time you're in Oxford. ;-)

Many thanks,
-James Cummings
(posting from a new and silly domain name)

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>