xsl-list
[Top] [All Lists]

Re: [xsl] finding a minimum set of references

2013-01-09 20:40:15
Graydon,

I am glad this was useful.

Forgive me for not mentioning keys explicitly -- I can't imagine that
someone wouldn't use keys for this :)

Cheers,
Dimitre

On Wed, Jan 9, 2013 at 5:55 PM, Graydon <graydon(_at_)marost(_dot_)ca> wrote:
On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit:
This can be done in three steps:

 1. Get the distinct values of concat(@area,'+',@site) for all tree elements.

 2. Get a sorted sequence of the corresponding `loc` elements that
were produced in step 1. The sort is by the count of `tree` elements
that have that particular loc element as descendant.

 3. Recursively add to the "minList" the top of the sorted loc
elements that isn't yet in this "minList". Stop condition -- when
there is nothing more to add.

This is very similar to what I did, after having woken up at the 3 AM
local going "wait! I can do this!" and making rapid notes.  (Having
learnt what happens should I _not_ make those notes!)

(Much thanks to Wolfgang for saying "use keys!" last night.)

What I did was to go through each tree element's loc element children, and
count how many other such loc elements existed anywhere in the data set. (Keys
were very useful for this, and just let me note that trying to stick the 
result
of a key lookup that returns a node in a comment via xsl:sequence during
debugging is something I should remember not to do more reliably.)

Once I had that count associated with each loc element (@other), I deleted all
the loc elements but the one with the highest value for @other.

Third step was to make a unique list of the @area/@cite pairs contained in
those surviving loc elements, and consider that to represent the minimum
spanning set of documents that contain all the element parent/child pattern
examples.

(There are other steps about "the example of this parent/child pattern is in
this document" and fetching those documents and so on, but those aren't in the
code sample below, which is already plenty long enough.)

I believe that this would really produce the minimum sequence of
locations -- due to the fact that we pick locations in decreasing
order of trees that reference them.

I think what I've got, which I suspect is equivalent in a graph-theoretic
sense, also does this.  I am also really really glad my current clients aren't
going to ask me to _prove_ that.

The input set this was tested on had 19,503 documents in it; the minimum
spanning set of examples that results has 294 documents in it, so about 1.5% 
of
the total.  If that's not actually minimal, it's close enough.  Total,
end-to-end run time (via Saxon 9.4.something in oXygen), from "make a
collection out of this half-GB pile of documents" to "minimum spanning set of
documents in a pile" is about 65 seconds. (The bit below, run independently on
the big set of tree elements, takes about a second.)

I'm really pleased with this result.  (Some of the previous attempts were
taking more than half an hour of run time to not work....)

I would be willing to produce the code that implements this algorithm,
but I would need to be provided with a complete (but not too long)
source XML document -- so that the result could be easily verified

That's extraordinarily kind of you, but I really did just need the algorithm
hint.  (This time, at least! :)

I'm greatly reassured at the similarity of the suggested methodology, too!

Thanks, Dimitre!

-- Graydon


<xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" 
xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl";
  xmlns:xs="http://www.w3.org/2001/XMLSchema"; 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xd:doc scope="stylesheet">
    <xd:desc>
      <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p>
      <xd:p><xd:b>Author:</xd:b> graydon</xd:p>
      <xd:p/>
    </xd:desc>
  </xd:doc>
  <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/>
  <xsl:template match="/">
    <xsl:variable name="getLocCounts">
      <xsl:apply-templates mode="countLoc" select="node()"/>
    </xsl:variable>
    <xsl:variable name="findMostOther">
      <xsl:apply-templates mode="mostOther" select="$getLocCounts"/>
    </xsl:variable>
    <xsl:variable name="makeDocumentList">
      <xsl:apply-templates mode="docList" select="$findMostOther"/>
    </xsl:variable>
    <xsl:sequence select="$makeDocumentList"/>
  </xsl:template>
  <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify">
    <xsl:copy>
      <xsl:apply-templates mode="#current" select="node()|@*"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="loc" mode="countLoc">
    <xsl:copy>
      <xsl:attribute name="other" 
select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/>
      <xsl:apply-templates mode="countLoc" select="node()|@*"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="locations" mode="mostOther">
    <xsl:copy>
      <xsl:for-each-group group-by="@other" select="loc">
        <xsl:sort data-type="number" order="descending" select="@other"/>
        <xsl:if test="position() eq 1">
          <xsl:sequence select="current-group()[1]"/>
        </xsl:if>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="container" mode="docList">
    <wkna-shared-cms>
      <assembly area="testpub" xml:lang="en">
        <xsl:for-each-group group-by="concat(@area,'|',@cite)" 
select="tree/locations/loc">
          <xsl:sort select="current-grouping-key()"/>
          <include area="{current-group()[1]/@area}" 
cite="{current-group()[1]/@cite}"/>
        </xsl:for-each-group>
      </assembly>
    </wkna-shared-cms>
  </xsl:template>
</xsl:stylesheet>

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--