xsl-list
[Top] [All Lists]

Re: [xsl] Tree Comparing Algorithm

2020-02-03 11:15:46
The only facility in the XSLT 3.0 to allow streaming of two input files "in 
parallel" is xsl:merge, and as Martin points out, that's rather specialised and 
not really suited to your requirements. 

In Saxon, streaming is in most cases done in push mode (where the parser owns 
the control loop, and sends events to the XSLT processor). You can't have two 
parallel control loops except with multi-threading, so the opportunities for 
streaming multiple files are limited (with xsl:merge, Saxon indeed uses 
multi-threading).

At first sight, I don't see an XSLT-based answer to this one.

Except, perhaps: you could do a streamed transformation of each input documents 
into an XML representation of an event stream, like

<startElement name="folder" path="" hash=""/>
<startElement name="folder" path="" hash=""/>
<endElement name="folder"/>

etc

and then attempt to do an xsl:merge of the two event streams.

Michael Kay
Saxonica

On 3 Feb 2020, at 13:47, Vasu Chakkera vasucv(_at_)gmail(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Hi All,
I am planning to write a XML Tree comparing XSLT using streaming.  
The XML Trees look something like this
<root path="" mhash =" ">
  <folder path ="" mhash ="">
    <folder path ="" mhash ="">
       <leaf path ="" mhash ="">
       </leaf>
    </folder>
  </folder>
</root>
There will be two such XML files to compare . These two XMLs are generated 
before and after moving a folder from source to destination. Source and 
destination could be two different OS.
This is essentially the serialized Merkle Tree output of a folder structure. 
The idea is to run a Merkle Tree comparator that will pick the nodes that did 
not match. Rules are as follows.
If the root node in both the tree matches, then there is not difference in 
the entire tree(because of how the Merkle tree is generated)
If root node hash does not match, we go to the child container and compare 
the hash of the child container in both the XML files. ( the XML folders 
structure will be identical with respect to the hash, but the folder  path 
may be different because of the linux, windows path conventions. Otherwise 
the folder structure is meant to be the same.)
If the hash of a folder from both the trees are same, the entire tree under 
the folder that matches the hash is ignored.
if the hash of a folder from both the trees are not the same, then the tree 
is further traversed and the step 3 is repeated.
The XSLT keeps writing out the nodes that do not match the hashes in the 
source and target xml files

So at the end of the processing, A comparator tree should be serialized, that 
has the nodes that have a non matching leaf node.
Looking at the serialized tree, we can determine, which files got messed up 
while doing a transfer from Source to target.



I am able to do this using non streaming xslt, but with streaming, since we 
need to stream two trees at a time and match compare the nodes,  i am not 
very sure how to proceed.
I am able to do manipulations on one XML with streaming. I tried a few 
tricks, but did not get anywhere ( I am not very comfortable copying my code 
scribbling here)

I need streaming because the XML files may be big.
If someone has done something similar, or point me to an  intelligent way to 
do this, I will be thankful.

Vasu



XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/293509> (by 
email <>)
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>