xsl-list
[Top] [All Lists]

RE: use XSLT or XQuery in Saxon?

2005-01-06 02:47:06

I have extremely large (over 300 MB) XML file and tens
of thousands of small xml files generated after
applying various XSLT on the one big XML file.

You're right, 300Mb *is* large (I had someone recently ask how to process a
large file and it turned out to be 300Kb). You have a choice between
spending money on lots of memory (say 2Gb, but it depends on the actual
structure) and doing more development work to split the task up. This
applies equally whether you are using XSLT or XQuery - in Saxon these are
really just different surface syntaxes for the same processing engine.

I am using Saxon for XSLT and will be using it also
for XQuery.

Is Xquery or XSLT is better solution for this problem?
Query each text node in the big xml file and verify
that this content is present in one of the results xml
files.

Clearly this requires a better algorithm than searching all the small files
once for each text node in the large file.

One solution is to aggregate the small files into a single document and
index it using a key. This would require XSLT, because keys are not
available in XQuery. Some XQuery implementations might do an indexed join
automatically, but Saxon doesn't (yet). Of course, aggregating the small
files means even more memory.

Another solution, again dependent on XSLT, is to use grouping. This doesn't
require the small documents to be aggregated into a single document. If you
take the union of the text nodes in the large document and the values in the
small documents, and then do grouping, a group of size 1 indicates a value
that is present in one file and not the other.

However, if performance is really important (you don't actually say), I
think I would be inclined to write this "by hand" as a SAX application. It
will probably be an order of magnitude faster that way.

In the past it was taken for granted that to handle 300Mb of data you needed
a database. I wouldn't rule this option out: it largely depends on where the
data comes from and what its lifecycle looks like. Databases are designed
specifically for this kind of job.

Michael Kay
http://www.saxonica.com/

 Based on this information generate a report
that shows which content is present and in which file
and in a separate section which content was not found
in result xml files and also show this content parent
element or other markup to indicate its position in
the big xml file.

All the small xml files are stored as flat files in
various directories on Windows File system although
most files are in one directory. The big XML file is
fairly complex with multiple levels of nesting
elemenents.

Any comments or suggestions?
Thank you


              
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - now with 250MB free storage. Learn more.
http://info.mail.yahoo.com/mail_250

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



<Prev in Thread] Current Thread [Next in Thread>