I haven't studied it in close detail, but I strongly suspect that the initial
processing of the input files is streamed, but at some stage in the processing
pipeline everything ends up in memory.
Martin's solution uses arrays, and array processing in Saxon is generally not
pipelined in the way that sequence processing (normally) is. For example,
operations such as filtering and mapping on sequences are generally pipelined
(whether or not the input is streamed), while the equivalent operations on
arrays will materialise the array in memory.
For example, if you do (child::*[@x]/node-name() = $Q), then whether or not
the child nodes are held in memory or streamed, Saxon will not build the
intermediate sequence child::*[@x] in memory; it will effectively do something
for each child::*
if (node-name() = $Q)
else return false;
There's no equivalent of this for array processing right now. A construct like
[child::*/node-name()]?1 = X will materialize the array in memory, even if
child::* is streamed; and this doesn't count as a violation of streamability,
because we're not holding source nodes in memory, we're holding intermediate
computed results in memory.
There's no intrinsic reason for not pipelining operations on arrays, other than
the lesson I learnt many years ago as an undergraduate computer science
student: when you're doing optimisation, focus your efforts on the constructs
that are encountered most frequently. Today everyone is using sequences, and
not many people are using arrays.
On 3 Feb 2020, at 20:39, Martin Honnen martin(_dot_)honnen(_at_)gmx(_dot_)de
On 03.02.2020 21:10, Vasu Chakkera vasucv(_at_)gmail(_dot_)com wrote:
Thanks both. Martin's solution sort of worked, but it only gave me 21
children, but I had around 21000 nodes in the xml. I am not sure to what
depth the comparison is happening.
It was solely an attempt to try to find some way to recursively process
two documents with streaming at the same time, not an attempt to
implement your particular algorithm.
I have tested my code now on a large files, it seems to process lots of
nodes judging by the output to the console and the length of processing,
but it doesn't seem to use streaming when I look at the memory
consumption (600MB of input needed more than 2GB of memory), even if
Saxon nowhere shows any -t message that input trees were built.
Michael's comment on the way streaming is implemented in Saxon suggests
that the whole attempt is futile, even if the code somehow manages to
get by the streamability analysis.
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com