xsl-list
[Top] [All Lists]

[xsl] efficient traversal of combined collections in XSLT 3.0

2012-11-24 07:53:19
So I have about 4.0 GB of "production" content, XML that's already in use, can 
have deliverables generated from it, and which various groups of editors may 
change.

I have "content", some content (generally about .2 or .25 GB) that is being 
converted from SGML and which, before it is added to "production", needs to be 
checked to see if the links in it work.

links use a combination of @area (the name of a uniqueness of numbers) and 
@cite (the number); this is for legislation, so the numbers can get complicated 
by the basic scheme is pretty simple.  (targets are one direction in a 
bi-directional relationship, so a link in a fancy hat; they usually contain 
links, and we only need to check them if they _don't_ contain a link.)

The slightly tricky bit is that I want to check the links in "content" to see 
if they match something in "content" _and_ in "production"; XSLT 3.0's version 
of key() will accept an arbitrary top-node, so (using the Saxon 9.4 which ships 
with current, 14.1 oXygen) I can declare the stylesheet to be version 3.0, 
combine "production" and "content" into "searchSpace", and define a key on that.

<xsl:stylesheet exclude-result-prefixes="xs xd" version="3.0"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"; 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xsl:variable name="content" 
select="collection('file:///home/graydon/stages/APFF?recurse=yes;select=*.xml')"/>
  <xsl:variable name="production"
    
select="collection('file:///home/graydon/stages/production/2012-11-13?recurse=yes;select=*.xml;on-error=ignore')"/>
  <xsl:variable name="searchSpace" select="($content,$production)"/>
  <xsl:key match="*[num[@cite]]" name="places" 
use="concat(ancestor-or-self::*[@area][1]/@area,'|',num[1]/@cite)"/>
  <xsl:template match="/">
    <bucket>
      <xsl:for-each 
select="$content//link,$content//target[not(reference-text/link)]">
        <xsl:choose>
          <xsl:when 
test="key('places',concat(current()/@area,'|',current()/@cite),$searchSpace)">
            <good>
              <uri>
                <xsl:sequence select="base-uri(.)"/>
              </uri>
              <xsl:sequence select="."/>
            </good>
          </xsl:when>
          <xsl:otherwise>
            <bad>
              <uri>
                <xsl:sequence select="base-uri(.)"/>
              </uri>
              <xsl:sequence select="."/>
            </bad>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each>
    </bucket>
  </xsl:template>
</xsl:stylesheet>

This works well on content-sized chunks of input (.25 GB or so) and I get an 
answer in about 15 seconds.

It doesn't work on the full data set; 16 GB of RAM isn't enough to do this to 4 
GB of data.  Various wheels are in motion to get more RAM.

So maybe everything will be fine, but I can't help looking at that code and 
going "this is a really naive search; there has to be a more efficient way to 
do this."

So, O XSLT List, what's the more efficient way to do this?

Thanks!

-- Graydon

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--