So I have about 4.0 GB of "production" content, XML that's already in use, can
have deliverables generated from it, and which various groups of editors may
change.
I have "content", some content (generally about .2 or .25 GB) that is being
converted from SGML and which, before it is added to "production", needs to be
checked to see if the links in it work.
links use a combination of @area (the name of a uniqueness of numbers) and
@cite (the number); this is for legislation, so the numbers can get complicated
by the basic scheme is pretty simple. (targets are one direction in a
bi-directional relationship, so a link in a fancy hat; they usually contain
links, and we only need to check them if they _don't_ contain a link.)
The slightly tricky bit is that I want to check the links in "content" to see
if they match something in "content" _and_ in "production"; XSLT 3.0's version
of key() will accept an arbitrary top-node, so (using the Saxon 9.4 which ships
with current, 14.1 oXygen) I can declare the stylesheet to be version 3.0,
combine "production" and "content" into "searchSpace", and define a key on that.
<xsl:stylesheet exclude-result-prefixes="xs xd" version="3.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:variable name="content"
select="collection('file:///home/graydon/stages/APFF?recurse=yes;select=*.xml')"/>
<xsl:variable name="production"
select="collection('file:///home/graydon/stages/production/2012-11-13?recurse=yes;select=*.xml;on-error=ignore')"/>
<xsl:variable name="searchSpace" select="($content,$production)"/>
<xsl:key match="*[num[@cite]]" name="places"
use="concat(ancestor-or-self::*[@area][1]/@area,'|',num[1]/@cite)"/>
<xsl:template match="/">
<bucket>
<xsl:for-each
select="$content//link,$content//target[not(reference-text/link)]">
<xsl:choose>
<xsl:when
test="key('places',concat(current()/@area,'|',current()/@cite),$searchSpace)">
<good>
<uri>
<xsl:sequence select="base-uri(.)"/>
</uri>
<xsl:sequence select="."/>
</good>
</xsl:when>
<xsl:otherwise>
<bad>
<uri>
<xsl:sequence select="base-uri(.)"/>
</uri>
<xsl:sequence select="."/>
</bad>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</bucket>
</xsl:template>
</xsl:stylesheet>
This works well on content-sized chunks of input (.25 GB or so) and I get an
answer in about 15 seconds.
It doesn't work on the full data set; 16 GB of RAM isn't enough to do this to 4
GB of data. Various wheels are in motion to get more RAM.
So maybe everything will be fine, but I can't help looking at that code and
going "this is a really naive search; there has to be a more efficient way to
do this."
So, O XSLT List, what's the more efficient way to do this?
Thanks!
-- Graydon
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--