xsl-list
[Top] [All Lists]

Re: [xsl] Tree Comparing Algorithm

2020-02-03 21:44:06
 If the hash of a folder from both the trees are same, the entire tree
under the folder that matches the hash is ignored

Just a minor note that has nothing to do with XSLT:

It is not sufficient that two objects have the same hash code for them to
be considered "equal". When the hash codes are different this means that
the two objects are not "equal", however the equality of the hash codes
doesn't automatically mean the objects should be "equal"

Even if the probability of two objects to have the same hash code is low,
we should take this into account -- for example, we may choose to calculate
a pair of hash codes for each object, using two independent hashing
algorithms.

Cheers,
Dimitre

On Mon, Feb 3, 2020 at 5:46 AM Vasu Chakkera vasucv(_at_)gmail(_dot_)com <
xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Hi All,
I am planning to write a XML Tree comparing XSLT using streaming.
The XML Trees look something like this

<root path="" mhash =" ">

  <folder path ="" mhash ="">

    <folder path ="" mhash ="">

       <leaf path ="" mhash ="">

       </leaf>

    </folder>

  </folder>

</root>

There will be two such XML files to compare . These two XMLs are generated
before and after moving a folder from source to destination. Source and
destination could be two different OS.
This is essentially the serialized Merkle Tree output of a folder
structure. The idea is to run a Merkle Tree comparator that will pick the
nodes that did not match. Rules are as follows.

   1. If the root node in both the tree matches, then there is not
   difference in the entire tree(because of how the Merkle tree is generated)
   2. If root node hash does not match, we go to the child container and
   compare the hash of the child container in both the XML files. ( the XML
   folders structure will be identical with respect to the hash, but the
   folder  path may be different because of the linux, windows path
   conventions. Otherwise the folder structure is meant to be the same.)
   3. If the hash of a folder from both the trees are same, the entire
   tree under the folder that matches the hash is ignored.
   4. if the hash of a folder from both the trees are not the same, then
   the tree is further traversed and the step 3 is repeated.
   5. The XSLT keeps writing out the nodes that do not match the hashes
   in the source and target xml files


So at the end of the processing, A comparator tree should be serialized,
that has the nodes that have a non matching leaf node.
Looking at the serialized tree, we can determine, which files got messed
up while doing a transfer from Source to target.



I am able to do this using non streaming xslt, but with streaming, since
we need to stream two trees at a time and match compare the nodes,  i am
not very sure how to proceed.
I am able to do manipulations on one XML with streaming. I tried a few
tricks, but did not get anywhere ( I am not very comfortable copying my
code scribbling here)

I need streaming because the XML files may be big.
If someone has done something similar, or point me to an  intelligent way
to do this, I will be thankful.

Vasu



XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <http://lists.mulberrytech.com/unsub/xsl-list/782854> (by
email <>)

--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>