RE: optimization for very large, flat documents

I'm trying to process a very large (600 MB) flat XML document, a
bibliography where each of the 400,000 entries is completely 
independent
of the others.  According to the Saxon web site and mailing 
list, it'll
take approx. 5-10 times that (3 GB) to hold the document tree 
in memory,
which is impractical.  The Saxon mailing list also has some tips about
how to accomplish this, but my question is: Why doesn't XSLT provide a
way to specify that a matched node can be processed 
independently of its
predecessor and successor siblings?  Alternatively, couldn't an XSLT
processor infer that from the complete absence of XPath 
expressions that
refer to predecessor and successor siblings?


I think the reason that XSLT vendors have not tried this approach is:

(a) there are rather few stylesheets where the technique works, and can be
seen statically to work. It's not enough that all path expressions should
select downwards: there must be no absolute path expressions, no global
variables that select from the initial context node, no keys, and probably
quite a few other conditions besides.

(b) for such stylesheets, a completely different run-time approach is
needed: effectively, a different XSLT processor.

I think that in practice if you want to do serial transformation then a
functional language is not the right answer: if you can only look at each
piece of input data once, then you need the ability to remember what you
have seen, so you need a procedural language with updatable memory. That's
why STX was invented.

However, I think there is scope for someone to package up the idea of
running an XSLT transform on each "record" in a large file, and then
recombining the results.

Michael Kay
http://www.saxonica.com/ 

 


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--