xsl-list
[Top] [All Lists]

Re: [xsl] Comparing documents: what of P is a subset of D?

2014-02-28 04:57:23
@Michael: your answer triggered a thought process that outlined the way to
a solution I'm able to implement. I don't know whether this is of any
interest to
others, but it's a nice little exercise for a training, illustrating mode, key,
another input document.

Problem:
Given two XML files according to the same XML schema, find all leave
nodes (text() and @*) in one document ("Patch") that have an identical
value at the same iXPath
in the other document ("Data"), where an iXPath is an XPath using
element, attribute names and predicates [@_ix eq n] wherever they
occur (in repeating elements).

Solution outline:
Process the Patch document, creating a set of nodes <p2v @path @value>
mapping iXPaths to values, with a key based on @path. Then, process
the Data document analoguously, looking up iXPaths in the key and
comparing values, where found.

Below is the code, very likely not perfect ;-)

(Note that the output would be much more readable if an iXPath could
be truncated at a point where the subtree is identical in the defined
way.)

Thanks
W

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
        xmlns:xs="http://www.w3.org/2001/XMLSchema";
        xmlns:wl="http://members.inode.at/w.laun";
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:output method="text" />
<xsl:strip-space elements = '*' />

<xsl:param name="patchfile" as="xs:string"/>
<xsl:variable name="patch" select="document($patchfile)" />

<xsl:key name = "path2value" match = "p2v" use = "@path"/>

<!-- pass over patch file -->

<xsl:variable name="map" as="document-node()">
  <xsl:document>
    <map>
    <xsl:for-each select = "$patch">
      <xsl:apply-templates select = "*" mode="indexing">
        <xsl:with-param name = "path" select = "''" />
      </xsl:apply-templates>
    </xsl:for-each>
    </map>
  </xsl:document>
</xsl:variable>

<xsl:template match="*" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="indexing">
    <xsl:with-param name = "path" select = "concat( $path, '/',
local-name() )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*[@_ix]" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="indexing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name(), '[',
@_ix, ']' )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="@*" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "fp" select = "concat( $path, '/', local-name() )"/>
  <p2v path = "{$fp}" value = "{.}"/>
</xsl:template>

<xsl:template match="@_ix" mode="indexing"/>

<xsl:template match="text()" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <p2v path = "{$path}" value = "{.}"/>
</xsl:template>

<!-- Pass over DB data file -->

<xsl:template match = "/">
  <xsl:apply-templates mode="comparing">
    <xsl:with-param name = "path" select = "''" />
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="comparing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name() )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*[@_ix]" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="comparing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name(), '[',
@_ix, ']' )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="@*" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "fp" select = "concat( $path, '/', local-name() )"/>
  <xsl:variable name = "pval" select = "key( 'path2value', $fp,
$map/map )/@value"/>
  <xsl:if test = "$pval eq .">
    <xsl:value-of select = "concat( $fp, ' ... ', $pval)"/><xsl:text>
</xsl:text>
  </xsl:if>
</xsl:template>

<xsl:template match="@_ix" mode="comparing"/>

<xsl:template match="text()" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "pval" select = "key( 'path2value', $path,
$map/map )/@value"/>
  <xsl:if test = "$pval eq .">
    <xsl:value-of select = "concat( $path, ' ... ', $pval)"/><xsl:text>
</xsl:text>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>



On 27/02/2014, Michael Kay <mike(_at_)saxonica(_dot_)com> wrote:
I'm not sure I've completely understood your "equality" relation that
underpins the intersection. Perhaps it's based on equality of the function

string-join(ancestor-or-self::*/@_ix, '|')

let's call this function $f, and we can use this as a parameter to the rest
of the solution.

we then need to do

doc('d.xml')//fc[some $e in doc('p.xml') satisfies $f($e) eq $f(.)] !
path(.)

where path(.) is a function you can write to display the path to the
selected fc element.

The only remaining problem is that this is O(n*m) where n and m are the
sizes of D and P. For a more efficient solution, define a key on P.XML that
indexes each element on the value of the function $f, and replace the
predicate by a call on key().

The above uses XPath 3.0, but it can probably be expressed in XPath 2.0
easily enough at the cost of hard-coding the equality function.

Michael Kay
Saxonica


On 27 Feb 2014, at 10:25, Wolfgang Laun 
<wolfgang(_dot_)laun(_at_)gmail(_dot_)com> wrote:

<cca><!-- a D XML -->
 <rela _ix='0' fa='0' fb='1'>
    <fc _ix='1' fc_fa='X1' fc_fb='1'/>
    <fc _ix='2' fc_fa='X2' fc_fb='2'/>
 </rela>
 <rela _ix='1' fa='10' fb='11'>
    <fc _ix='1' fc_fa='Y1' fc_fb='11'/>
    <fc _ix='2' fc_fa='Y2' fc_fb='12'/>
 </rela>
 <rela _ix='5' fa='50' fb='51'>
    <fc _ix='1' fc_fa='A1' fc_fb='51'/>
    <fc _ix='2' fc_fa='A2' fc_fb='52'/>
 </rela>
 <relb>...</relb>
 <relc>...</relc>
</cca>

<cca><!-- a P XML -->
 <rela _ix='1' fa='10'>
    <fc _ix='1' fc_fa='Y1' fc_fb='99'/>
 </rela>
<rela _ix='5' fa='50' fb='51'>
    <fc _ix='1'                 fc_fb='51' fc_fc='123'/>
    <fc _ix='2' fc_fa='A2' fc_fb='52' fc_fc='456'/>
 </rela>
</cca>

Expected output:

/cca/rela(1)/fa   10
/cca/rela(1)/fc(1)/fc_fa   Y1
/cca/rela(5)/fa   50
/cca/rela(5)/fa   51
/cca/rela(5)/fc(1)/fc_fb   51
/cca/rela(5)/fc(2)/fc_fa   A2
/cca/rela(5)/fc(2)/fc_fb   52

Note that parentheses enclose values of @_ix.

-W

On 27/02/2014, Michael Kay <mike(_at_)saxonica(_dot_)com> wrote:
It would be easier to understand the problem with some example data.

Michael Kay
Saxonica

On 27 Feb 2014, at 08:05, Wolfgang Laun 
<wolfgang(_dot_)laun(_at_)gmail(_dot_)com> wrote:

The data model for a set of similarly (but not identically) built XML
documents is: a collection of arrays of records, which may contain
(recursively) arrays, records and scalars. (The terms "array" and
"record" are used in their "classic" meaning as, e.g., in Pascal.)
Document structures are fairly stable, but they do change over time.
Array elements are identified (indexed) by @_ix, not by position.
Record fields can be elements or attributes (when they are scalar).
Order is undefined, since XPaths plus @_Ix's pinpoint each node.

One XML document D contains a full population for such a data set
(O(1MB)). A second XML document P contains "patches", i.e., each node
appearing in P is expected to be in D as well.

If S(P) is the sequence of nodes (annotated with their XPaths) in P
and S(D) the one with nodes from D, how can I determine S(P) intersect
S(D) (except all @_ix, whose values are bound to be identical)? Of
course, I don't want the common set of *data items* - I want the XML
paths of those common data items.

A solution (in XSLT 2.0) should not need individual adaption for each
kind of data set.

I'm confident that I can create text files for D and P containing one
line <path> <value> for each node and run diff (after sort).

Any better ideas?

Cheers
Wolfgang

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--