xsl-list
[Top] [All Lists]

Re: grouping and word counting

2003-07-19 09:56:04
Hi Marina,

One can use the string tokeniser from FXSL (the "str-split-to-words"
template) in order to obtain a list of words from a string and then count
them.

This, combined with the Muenchian method for grouping gives us the following
solution.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
 xmlns:ext="http://exslt.org/common";
 exclude-result-prefixes="ext">

 <xsl:import href="strSplit-to-Words.xsl"/>

  <xsl:output method="text"/>

  <xsl:key name="kMsg" match="MESSAGE" use="."/>

  <xsl:key name="kByCount" match="m" use="@count"/>

  <xsl:template match="/">
    <xsl:variable name="vPass1">
      <xsl:for-each
        select="/*/*/MESSAGE[generate-id()
                            =
                             generate-id(key('kMsg',
                                             .
                                             )[1]
                                         )
                             ]">
         <xsl:sort select="count(key('kMsg',.))"
                   data-type="number"/>
         <m count="{count(key('kMsg',.))}"
            text="{.}"/>
      </xsl:for-each>
    </xsl:variable>

    <xsl:for-each
    select="ext:node-set($vPass1)/m
                   [generate-id()
                   =
                    generate-id(key('kByCount',
                                     @count
                                    )[1]
                                )
                   ]">
      <xsl:sort select="count(key('kByCount', @count))"
           data-type="number"/>

      <xsl:variable name="vAllText">
        <xsl:for-each select="key('kByCount', @count)">
          <xsl:value-of select="concat(' ', @text, ' ')"/>
        </xsl:for-each>
      </xsl:variable>

      <xsl:variable name="vrtfWords">
        <xsl:call-template name="str-split-to-words">
          <xsl:with-param name="pStr" select="$vAllText"/>
          <xsl:with-param name="pDelimiters" select="' '"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:variable name="vAvWords"
       select="(count(ext:node-set($vrtfWords)/word) - 1)
             div
               count(key('kByCount', @count))"/>

      <xsl:value-of select="concat(count(key('kByCount',
                                              @count
                                             )
                                         ),
                                   ' ',
                                   @count,
                                   ' ',
                                   $vAvWords,
                                   '&#xA;'
                                   )"/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>


when applied on your source.xml:

<LOG>
  <SENT>
    <USER> 12345 </USER>
    <LOCATION> 55555 </LOCATION>
    <TARGET> 1 </TARGET>
    <TARGET_LOCATION> 23222 </TARGET_LOCATION>
    <MESSAGE> hello Fred </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 3 </TARGET>
    <TARGET_LOCATION> 34444 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 4 </TARGET>
    <TARGET_LOCATION> 67777 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 8 </TARGET>
    <TARGET_LOCATION> 98765 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 9 </TARGET>
    <TARGET_LOCATION> 46578 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
</LOG>


produces the wanted result:

1 1 2
2 2 3


Hope this helped.


=====
Cheers,

Dimitre Novatchev.
http://fxsl.sourceforge.net/ -- the home of FXSL


"marina" <marina777uk(_at_)yahoo(_dot_)com> wrote in message
news:20030719075801(_dot_)60127(_dot_)qmail(_at_)web40609(_dot_)mail(_dot_)yahoo(_dot_)com(_dot_)(_dot_)(_dot_)
Hi,

I have an XML document that contains messages sent by
people to one another. Many of these messages in the
<MESSAGE> tags are repeated as they are sent by one
person to many others.

XML Snippet:
--------------------------------------------------
<LOG>
   <SENT>
      <USER> 12345 </USER>
      <LOCATION> 55555 </LOCATION>
      <TARGET> 1 </TARGET>
      <TARGET_LOCATION> 23222 </TARGET_LOCATION>
      <MESSAGE> hello Fred </MESSAGE>
   </SENT>
   <SENT>
      <USER> 77777 </USER>
      <LOCATION> 76666 </LOCATION>
      <TARGET> 3 </TARGET>
      <TARGET_LOCATION> 34444 </TARGET_LOCATION>
      <MESSAGE> nice weather </MESSAGE>
   </SENT>
   <SENT>
      <USER> 77777 </USER>
      <LOCATION> 76666 </LOCATION>
      <TARGET> 4 </TARGET>
      <TARGET_LOCATION> 67777 </TARGET_LOCATION>
      <MESSAGE> nice weather </MESSAGE>
   </SENT>
   <SENT>
      <USER> 33333 </USER>
      <LOCATION> 12666 </LOCATION>
      <TARGET> 8 </TARGET>
      <TARGET_LOCATION> 98765 </TARGET_LOCATION>
      <MESSAGE> whats the latest news? </MESSAGE>
   </SENT>
   <SENT>
      <USER> 33333 </USER>
      <LOCATION> 12666 </LOCATION>
      <TARGET> 9 </TARGET>
      <TARGET_LOCATION> 46578 </TARGET_LOCATION>
      <MESSAGE> whats the latest news? </MESSAGE>
   </SENT>
</LOG>
--------------------------------------------------
What I need to do is:-

1) Find out how many messages over all were sent to 1,
2, 3 etc people.

As a duplicated message will always follow the
original, i.e. be the next <MESSAGE> tag of the
following sibling node, I'm thinking that the
stylesheet would start with the first message and keep
comparing siblings until it found one that was
different. Then it would just add the previous number
of sibling nodes? ( I probably need to use keys?)

2) For each of the total messages per group size,
calculate the average number of words. No idea on this
one I'm afraid!

So the desired output from the snippet above would be:
-

Group Size Number of Messages Av Number Words
    1 1 2
    2 2 3
 (up to say 20)

Many thanks in advance for any help,

Marina




__________________________________
Do you Yahoo!?
SBC Yahoo! DSL - Now only $29.95 per month!
http://sbc.yahoo.com

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list






 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



<Prev in Thread] Current Thread [Next in Thread>