xsl-list
[Top] [All Lists]

RE: [xsl] Calculating groups of repeating elements

2008-12-11 14:56:32
Given that you have 250 places and 75 words, it seems to me that one could
approach this as follows:

(1) Construct all pairs of words that appear together in at least one place,
and find all the places in which this pair occurs. Discard the groups that
appear in only one place. This gives you all the groups of size 2.

(2) For each group of size 2, and the set of places in which this group
appears, construct candidate groups of size 3 by combining the 2-group with
all the other words appearing in those places. Again, discard the 3-groups
that appear in only one place. This gives you all the 3-groups

(3) Continue recursively to construct the 4-groups, 5-groups, etc

(4) Now add a step (0) before step (1) which constructs the 1-groups; the
expansion to 2-groups uses exactly the same logic.

Now finding the 1-groups looks like this:

<xsl:function name="f:level-one-groups" as="element(group)*">
<for-each-group select="//word" group-by=".">
  <xsl:if test="count(current-group()) gt 1">
    <group>
      <word><xsl:value-of select="current-grouping-key()"/></word>
      <xsl:copy-of select="place_number"/>
    </group>
  </xsl:if>
</xsl:for-each-group> 
</xsl:function>

and finding the level-n groups given the level-n-1-groups is:

<xsl:function name="f:level-n-groups" as="element(group)*">
  <xsl:param name="level-n-1-groups" as="element(group)*"/>
      <xsl:variable name="g" select="$level-n-1-groups"/>
      <xsl:for-each select="$g">
        <xsl:variable name="places" 
                      select="//place[place-number = $g/place_number]"/>
        <xsl:variable name="otherWords" 
                      select="$places/words/word[not(. = $g/word)]"/>
        <xsl:variable name="n-gram" 
                      select="$g/word, ."/>
        <xsl:variable name="places-with-n-gram" 
                      select="$places[every $w in n-gram satisfies
./words/word = $w]"/>
        <xsl:if test="count($places-with-n-gram) gt 1">
          <group>
            <xsl:for-each select="$n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
            <xsl:for-each select="$places-with-n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
          </group>
        </xsl:if>
     </xsl:for-each>
</xsl:function>

then you just have to do a recursion in which you start by finding the
level-1 groups, then recurse to levels 2, 3, 4 etc until you find there are
no groups at a particular level.

Not tested, of course, and almost certainly capable of improvement.

Michael Kay
http://www.saxonica.com/
          

-----Original Message-----
From: Quinn Dombrowski [mailto:qdombrow(_at_)uiuc(_dot_)edu] 
Sent: 10 December 2008 20:16
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] Calculating groups of repeating elements

Hello,

I'm trying to calculate all of the groups of 2+ elements (in 
the sample data below, words) that appear together in more 
than one place. Ideally, I'd like to be able to sort 
descending both by length of group (5-word group, 4-word 
groups, etc), and by number of places the groups occur (100 
places, 99 places, etc.) I also need to be able to list the 
place numbers where they occur.

I started doing it manually this way but the number of 
possible combinations quickly became too big a task:

<xsl:template match="/">
<xsl:value-of 
select="count(atlas/place/place_number[../words/word='Aa']
intersect atlas/place/place_number[../words/word='C'])"/>
</template>
(adding more "intersects" as necessary, and getting rid of 
the "count" 
to see the place numbers)

Here's a sample of the data. Almost every word appears in 
multiple places, but each appears only once in the index, 
which I've used in other applications for matching to avoid 
re-calculating stats for the word over and over. Any help 
would be wonderful!

<atlas>
<place>
<place_number>1</place_number>
<words>
<word>Aa</word>
<word>C</word>
<word>Qqq</word>
</words>
</place>

<place>
<place_number>2</place_number>
<words>
<word>Aa</word>
<word>Bbbb</word>
<word>C</word>
<word>W</word>
<word>Zz</word>
</words>
</place>

<place>
<place_number>3</place_number>
<words>
<word>Aa</word>
<word>C</word>
<word>Bb</word>
<word>Qqq</word>
<word>Wwww</word>
<word>Zz</word>
</words>
</place>

[etc]

<index>

<index_entry>
<underlying_word>A</underlying_word>
<word>A</word>
<word>Aa</word>
<word>Aaa</word>
</index_entry>

<index_entry>
<underlying_word>B</underlying_word>
<word>Bb</word>
<word>Bbbb</word>
</index_entry>

[etc]

</index>
</atlas>

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--