xsl-list
[Top] [All Lists]

Fw: [xsl] Question on duplicate node elimination

2010-08-24 06:54:25

Hello,

I finally got duplicate elimination to work and succesfully used
dupelim.xsl [1] with xsltproc and DataPower XSLT 1.0 processors.

First an idopy of the input document is generated:
  Every node (text,PI,comment,element,attribute,namespace) becomes a
  <node> and its id attribute holds the generate-id() value of the
  original document. The structure of the original document is
  preserved in the idcopy.

The generate-id() values from the input document allow for a duplicate
node elimination step. Stylesheet dupelim.xsl [1] demonstrates this.


There are questions open yet -- any help is appreciated.
In the duplicate elimination step the key() funtion returns (ghost)
nodes for which I have no idea where they come from.
Using Kaysian intersection method it is possible to filter out the
interesting nodes, but that is an overhead.
How can this intersection building be avoided?


Find below two testfiles I used.
The input in the stylesheet is selected as a subset of the //c nodes.
There are two demonstration scenarios in the stylesheet:
- parentStep followed by parentStep (commented out)
- ancestorStep (active, used in demo output below)

The questions from above show in the output below as:
  $cur=id2543928
  count($ks)=6
  count($intersect)=2
Why are there 6 nodes with key id2543928 although the intersection with
the real nodes consists of 2 nodes only?


$ cat simple2.xml
<a>
  <b>
    <c>1</c>
    <c>2</c>
  </b>
  <b>
    <c>3</c>
    <c>4</c>
  </b>
</a>

$ cat simple3.xml
<a>
  <b>
    <c>1</c>
    <d>
      <c>2</c>
    </d>
  </b>
  <b>
    <d>
      <c>3</c>
    </d>
    <d>
      <c>4</c>
    </d>
  </b>
</a>

$ xsltproc dupelim.xsl simple2.xml
count($nodes)=4
count($aux)=8
fst=id2543926
cnt=4
1: id2543926a
2: id2543926a
3: id2543926a
4: id2543926a
$cur=id2543926
count($ks)=4
count($intersect)=4
$cur=id2543928
count($ks)=6
count($intersect)=2
$cur=id2543926
count($ks)=4
count($intersect)=4
$cur=id2543928
count($ks)=6
count($intersect)=2
$cur=id2543926
count($ks)=4
count($intersect)=4
$cur=id2544540
count($ks)=6
count($intersect)=2
$cur=id2543926
count($ks)=4
count($intersect)=4
$cur=id2544540
count($ks)=6
count($intersect)=2
3
===============================================================================
<node xmlns:exslt="http://exslt.org/common"; id="id2543926" type="element"
name="a"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2543927"
type="text" value="&#10;  "/><node id="id2543928" type="element"
name="b"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2543929"
type="text" value="&#10;    "/><node id="id2543930" type="element"
name="c"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544534"
type="text" value="1"/></node><node id="id2544535" type="text" value="&#10;
"/><node id="id2544536" type="element" name="c"><node id="id2544161"
type="namespace" name="xml" value="http://www.w3.org/XML/1998/namespace"/>
<node id="id2544537" type="text" value="2"/></node><node id="id2544538"
type="text" value="&#10;  "/></node><node id="id2544539" type="text"
value="&#10;  "/><node id="id2544540" type="element" name="b"><node
id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544541"
type="text" value="&#10;    "/><node id="id2544542" type="element"
name="c"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544543"
type="text" value="3"/></node><node id="id2544544" type="text" value="&#10;
"/><node id="id2544545" type="element" name="c"><node id="id2544161"
type="namespace" name="xml" value="http://www.w3.org/XML/1998/namespace"/>
<node id="id2544546" type="text" value="4"/></node><node id="id2544548"
type="text" value="&#10;  "/></node><node id="id2544549" type="text"
value="&#10;"/></node><node xmlns:exslt="http://exslt.org/common";
id="id2543928" type="element" name="b"><node id="id2544161"
type="namespace" name="xml" value="http://www.w3.org/XML/1998/namespace"/>
<node id="id2543929" type="text" value="&#10;    "/><node id="id2543930"
type="element" name="c"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544534"
type="text" value="1"/></node><node id="id2544535" type="text" value="&#10;
"/><node id="id2544536" type="element" name="c"><node id="id2544161"
type="namespace" name="xml" value="http://www.w3.org/XML/1998/namespace"/>
<node id="id2544537" type="text" value="2"/></node><node id="id2544538"
type="text" value="&#10;  "/></node><node
xmlns:exslt="http://exslt.org/common"; id="id2544540" type="element"
name="b"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544541"
type="text" value="&#10;    "/><node id="id2544542" type="element"
name="c"><node id="id2544161" type="namespace" name="xml"
value="http://www.w3.org/XML/1998/namespace"/><node id="id2544543"
type="text" value="3"/></node><node id="id2544544" type="text" value="&#10;
"/><node id="id2544545" type="element" name="c"><node id="id2544161"
type="namespace" name="xml" value="http://www.w3.org/XML/1998/namespace"/>
<node id="id2544546" type="text" value="4"/></node><node id="id2544548"
type="text" value="&#10;  "/></node>
===============================================================================

Result:
<a>
  <b>
    <c>1</c>
    <c>2</c>
  </b>
  <b>
    <c>3</c>
    <c>4</c>
  </b>
</a>
===============
<b>
    <c>1</c>
    <c>2</c>
  </b>
===============
<b>
    <c>3</c>
    <c>4</c>
  </b>
===============

$

The result is correct with XPath "//c/ancestor::*" for file simple2.xml.
You may want to play with the c-predicate's position() borders.


[1] http://stamm-wilbrandt.de/en/xsl-list/dupelim.xsl


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294
----- Forwarded by Hermann Stamm-Wilbrandt/Germany/IBM on 08/24/2010 01:24
PM -----

From:       Hermann Stamm-Wilbrandt/Germany/IBM(_at_)IBMDE
To:         xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Date:       08/24/2010 01:40 AM
Subject:    Re: [xsl] Question on duplicate node elimination



Lars,

Could you create the rtf using a "special" attribute that preserves the
id of the node which you are copying? E.g.

             <xsl:attribute name="originalID" namespace="
http://hsw.org/specialNamespaceURI";>
               <xsl:value-of select="generate-id()" />
         </xsl:attribute>

Then you could use that originalID attribute to determine what nodes were
identical in the original, and strip out the originalID attribute after
using it.

that's cool, you have had nearly the same idea as I had!

That is the reason I did an "idcopy" -- every node
(text,PI,comment,element,
attribute,namespace) will become a <node> and its id attribute holds the
generate-id() value of the original document! (see below)


But I guess this would only work on elements, since only elements can
have attributes...

After idcopy every node of the original document is a <node> element and
can have attributes.
I have duplicate elimination working partially but am still facing some
problems. I will either post the solution or question(s) ...


  <!--       Generate idcopy of current node       Every <node> has these
attributes: id, type, value       Most <node>s have a name attribute  -->
  <xsl:template name="idcopy">
    <xsl:choose>
      <xsl:when test="count(. | ../namespace::*) !=
                      count(../namespace::*)">
        <xsl:apply-templates select="." mode="idcopy"/>
      </xsl:when>

      <xsl:otherwise>
        <node id="{generate-id()}" type="namespace" name="{name()}"
value="{.}"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <xsl:template match="@*" mode="idcopy">
    <node id="{generate-id()}" type="attribute" name="{name()}"
value="{.}"/>
  </xsl:template>

  <xsl:template match="node()" mode="idcopy">

    <node id="{generate-id()}" type="element" name="{name()}">

      <xsl:apply-templates select="@*" mode="idcopy"/>

      <xsl:for-each select="namespace::*">
        <node id="{generate-id()}" type="namespace" name="{name()}"
value="{.}"/>
      </xsl:for-each>

      <xsl:apply-templates
        select="*|text()|comment()|processing-instruction()"
mode="idcopy"/>
    </node>
  </xsl:template>

  <xsl:template match="comment()" mode="idcopy">
    <node id="{generate-id()}" type="comment" value="{.}"/>
  </xsl:template>

  <xsl:template match="processing-instruction()" mode="idcopy">
    <node id="{generate-id()}" type="processing-instruction" name="{name
()}" value="{.}"/>
  </xsl:template>

  <xsl:template match="text()" mode="idcopy">
    <node id="{generate-id()}" type="text" value="{.}"/>
  </xsl:template>




Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



From:       Lars Huttar <lars_huttar(_at_)sil(_dot_)org>
To:         xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc:         Hermann Stamm-Wilbrandt/Germany/IBM(_at_)IBMDE
Date:       08/23/2010 10:54 PM
Subject:    Re: [xsl] Question on duplicate node elimination



 On 8/22/2010 5:12 PM, Hermann Stamm-Wilbrandt wrote:
I'm not sure what you find surprising about the results you are seeing.
What results would you expect?
Not surprising.

But how could the algorithm step of "duplicate elimination" be done?
How can the duplicates be determined and removed, correctly?


If I'm understanding your question correctly (are you trying to
implement an XPath processor in XSLT 1.0?) I think it's impossible, if
you create the rtf simply using xsl:copy-of. Because as Mike said, once
you've copied nodes, the copies are distinct; there's no information in
the rtf(s) to distinguish copies of the same node from copies of
identical twins.

Could you create the rtf using a "special" attribute that preserves the
id of the node which you are copying? E.g.

               <xsl:attribute name="originalID" namespace="
http://hsw.org/specialNamespaceURI";>
                 <xsl:value-of select="generate-id()" />
          </xsl:attribute>

Then you could use that originalID attribute to determine what nodes were
identical in the original, and strip out the originalID attribute after
using it.

But I guess this would only work on elements, since only elements can have
attributes...

Lars



Perhaps I was not clear enough with my question.
How can this step (p. 40 from [1]) be implemented in XPath 1.0 plus
eslt:node-set():
A location step identifies a new mode-set relative to the context
node-set.
The location step is evaluated against each node in the context node-set,
and the union of the resulting node-sets becomes the context node-set for
the next step. Location steps consist of an axis identifier, a node test
and zero or more predicates (see Figure 3-4). ...


[1]

http://www.theserverside.net/tt/books/addisonwesley/EssentialXML/index.tss

Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



From:       Michael Kay <mike(_at_)saxonica(_dot_)com>
To:         xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Date:       08/22/2010 11:53 PM
Subject:    Re: [xsl] Question on duplicate node elimination



I'm not sure what you find surprising about the results you are seeing.
What results would you expect?

xsl:copy-of creates a new node. Copying the same node twice creates two
copies with distinct identity. Is that the issue?

Michael Kay
Saxonica

On 22/08/2010 22:25, Hermann Stamm-Wilbrandt wrote:
Hello,

I have a question on duplicate node elimination.

From the XPATH 1.0 specification:
...
* node-set (an unordered collection of nodes without duplicates)
...
An initial sequence of steps is composed together with a following step
as
follows. The initial sequence of steps selects a set of nodes relative
to
a
context node. Each node in that set is used as a context node for the
following step. The sets of nodes identified by that step are unioned
together. The set of nodes identified by the composition of the steps is
this union.
...

So "are unioned together" results in a node-set and that does not
contain
duplicates.

Now how can this algorithm step be realized in XPATH 1.0 plus
exslt:node-set
funtion?
(this would work in browsers with the technique from David Carlisle [1])


This is the output for below stylesheet simple.xsl on file simple.xml.
For the nodes four node /a/b/c their parents are copied into an
intermediate
result. But xsltproc and xalan show that the four nodes are different by
the
their generate-id() values, whereas the first pair and last pair are
representations of the same node.

xsltproc        xalan
1: id2659470    1: AbT0
2: id2659470    2: AbT0
3: id2659354    3: AbT1
4: id2659354    4: AbT1

1: id2659234    1: AbT2
2: id2659244    2: AbT3
3: id2659254    3: AbT4
4: id2659264    4: AbT5

1:<b>           1:<b>
     <c>1</c>         <c>1</c>
     <c>2</c>         <c>2</c>
   </b>             </b>
2:<b>           2:<b>
     <c>1</c>         <c>1</c>
     <c>2</c>         <c>2</c>
   </b>             </b>
3:<b>           3:<b>
     <c>1</c>         <c>1</c>
     <c>2</c>         <c>2</c>
   </b>             </b>
4:<b>           4:<b>
     <c>1</c>         <c>1</c>
     <c>2</c>         <c>2</c>
   </b>             </b>



$ cat simple.xml
<a>
   <b>
     <c>1</c>
     <c>2</c>
   </b>
   <b>
     <c>1</c>
     <c>2</c>
   </b>
</a>
$ cat simple.xsl
<xsl:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
   xmlns:exsl="http://exslt.org/common";

   <xsl:output omit-xml-declaration="yes"/>

   <xsl:template match="/">
     <xsl:variable name="rtf">
       <xsl:for-each select="/a/b/c">
         <xsl:copy-of select=".."/>
       </xsl:for-each>
     </xsl:variable>

     <xsl:for-each select="/a/b/c">
       <xsl:value-of select="position()"/><xsl:text>:</xsl:text>
       <xsl:value-of select="generate-id(..)"/>
<xsl:text>&#10;</xsl:text>
     </xsl:for-each>

     <xsl:text>&#10;</xsl:text>

     <xsl:for-each select="exsl:node-set($rtf)/*">
       <xsl:value-of select="position()"/><xsl:text>:</xsl:text>
       <xsl:value-of select="generate-id(.)"/><xsl:text>&#10;</xsl:text>
     </xsl:for-each>

     <xsl:text>&#10;</xsl:text>

     <xsl:for-each select="exsl:node-set($rtf)/*">
       <xsl:value-of select="position()"/><xsl:text>:</xsl:text>
       <xsl:copy-of select="."/><xsl:text>&#10;</xsl:text>
     </xsl:for-each>
   </xsl:template>

</xsl:stylesheet>
$


[1] http://dpcarlisle.blogspot.com/2007/05/exslt-node-set-function.html


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research&  Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or 
e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

X-Quarantine ID  /var/spool/MD-Quarantine/18/qdir-2010-08-22-18.13.01-001



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--