xsl-list
[Top] [All Lists]

[xsl] optimization of complex XPath

2010-11-18 19:21:10
A bit of background before I get to the question:

I have about 2 GiB of XML which contains links.  These links are
constructed as <link area="value" cite="numvalue"/>, and reference num
elements with matching cite values, <num cite="numvalue"/>.  (The area
value is associated with an ancestor of the num element.)  The guarantee
is that an area and cite pair is unique across the entire content set.

The roughly-a-third of the content represented by that 2 GiB needs to be 
checked for broken links. (The other 2/3 needs to be converted from SGML, and 
_then_ it will need to be checked for broken links.)

The currently converted content is all in the area "decisions", I can
only check links that point _to_ the area "decisions"; the internal
links of the already-converted content.  That makes I have about 100,000
out of the 250,000 links to check on this pass, but I will eventually
have to validate all the links in the whole (5+ GiB) content set.

I am currently using the XPath expression:

for $x in (//link[(_at_)area='decisions']/@cite) return
    (if ($x = (//num/@cite))
     then 'good'
     else concat('bad cite value ',$x,'&#x000a;')
    )

to check the links. (I don't need to qualify the num element in
//num/@cite with [ancestor::*[1]/@area='decisions'] because all the
currently converted content has an @area value of "decisions".)

This works, and dumps out out a text list of bad links (and one "good"
per good link, so I can count them) but it is slow in ways that make me
think it's at least O(N^2) and probably worse.  Performance is
reasonable on small subsets of the content but very slow on the whole of
the available XML.  Since I have to check the whole thing as a unit to
properly test the ~two fifths of the links that refer to the decisions
area, and since I will eventually have to run the check on two or three
times as much content, O(N^2) is a problem; I can't say it takes a week
to run the link check.  (Well, I can _say_ it, but it won't meet with a
pleased reception.)

How can I make this check faster?

I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or
(if really necessary) some other means, and would appreciate any
suggestions from the list's collective wisdom.

Thanks!
Graydon

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--