xsl-list
[Top] [All Lists]

Re: [xsl] Tree Comparing Algorithm

2020-02-03 18:49:39
I haven't studied it in close detail, but I strongly suspect that the initial 
processing of the input files is streamed, but at some stage in the processing 
pipeline everything ends up in memory. 

Martin's solution uses arrays, and array processing in Saxon is generally not 
pipelined in the way that sequence processing (normally) is. For example, 
operations such as filtering and mapping on sequences are generally pipelined 
(whether or not the input is streamed), while the equivalent operations on 
arrays will materialise the array in memory.

For example, if you do (child::*[@x][1]/node-name() = $Q), then whether or not 
the child nodes are held in memory or streamed, Saxon will not build the 
intermediate sequence child::*[@x] in memory; it will effectively do something 
like

for each child::*
   if exists(@x)
      if (first)
          if (node-name() = $Q)
              return true
      else return false;

There's no equivalent of this for array processing right now. A construct like 
[child::*/node-name()]?1 = X will materialize the array in memory, even if 
child::* is streamed; and this doesn't count as a violation of streamability, 
because we're not holding source nodes in memory, we're holding intermediate 
computed results in memory.

There's no intrinsic reason for not pipelining operations on arrays, other than 
the lesson I learnt many years ago as an undergraduate computer science 
student: when you're doing optimisation, focus your efforts on the constructs 
that are encountered most frequently. Today everyone is using sequences, and 
not many people are using arrays.

Michael Kay
Saxonica

On 3 Feb 2020, at 20:39, Martin Honnen martin(_dot_)honnen(_at_)gmx(_dot_)de 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

On 03.02.2020 21:10, Vasu Chakkera vasucv(_at_)gmail(_dot_)com wrote:
Thanks both. Martin's solution sort of worked, but it only gave me 21
children, but I had around 21000 nodes in the xml. I am not sure to what
depth the comparison is happening.

It was solely an attempt to try to find some way to recursively process
two documents with streaming at the same time, not an attempt to
implement your particular algorithm.

I have tested my code now on a large files, it seems to process lots of
nodes judging by the output to the console and the length of processing,
but it doesn't seem to use streaming when I look at the memory
consumption (600MB  of input needed more than 2GB of memory), even if
Saxon nowhere shows any -t message that input trees were built.

Michael's comment on the way streaming is implemented in Saxon suggests
that the whole attempt is futile, even if the code somehow manages to
get by the streamability analysis.

--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>