xsl-list
[Top] [All Lists]

Re: how to extract words from a text

2004-12-10 13:51:32
I decided to take a whack at it and came up with the following XSL file:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";

<xsl:output
  method="text"
  omit-xml-declaration="yes"
  indent="no"
/>

  <xsl:template match="text">
    <xsl:call-template name="makeList">
      <xsl:with-param name="textIn" select="translate(., ',', '')"/>
      <xsl:with-param name="wordsSoFar"/>
    </xsl:call-template>
  </xsl:template>

  <xsl:template name="makeList">
    <xsl:param name="textIn"/>
    <xsl:param name="wordsSoFar"/>
    <xsl:choose>
      <xsl:when test="contains($textIn, ' ')">
        <xsl:variable name="firstWord" select="substring-before($textIn, ' 
')"/>
        <xsl:choose>
          <xsl:when test="string-length($firstWord)>2 and 
not(contains($wordsSoFar, $firstWord))">
            <xsl:variable name="newString">
              <xsl:choose>
                <xsl:when test="string-length($wordsSoFar)=0">
                  <xsl:value-of select="$firstWord"/>
                </xsl:when>
                <xsl:otherwise>
                  <xsl:value-of select="$firstWord"/><xsl:text>, 
</xsl:text><xsl:value-of select="$wordsSoFar"/>
                </xsl:otherwise>
              </xsl:choose>
            </xsl:variable>
            <xsl:call-template name="makeList">
              <xsl:with-param name="textIn" 
select="substring-after($textIn, ' ')"/>
              <xsl:with-param name="wordsSoFar" select="$newString"/>
            </xsl:call-template>
          </xsl:when>
          <xsl:otherwise>
            <xsl:call-template name="makeList">
              <xsl:with-param name="textIn" 
select="substring-after($textIn, ' ')"/>
              <xsl:with-param name="wordsSoFar" select="$wordsSoFar"/>
            </xsl:call-template>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:when>
      <xsl:otherwise>
        <xsl:choose>
          <xsl:when test="string-length($textIn)>2">
            <xsl:choose>
              <xsl:when test="contains($wordsSoFar, $textIn)">
                <xsl:value-of select="$wordsSoFar"/>
              </xsl:when>
              <xsl:otherwise>
                <xsl:value-of select="$textIn"/><xsl:text>, 
</xsl:text><xsl:value-of select="$wordsSoFar"/>
              </xsl:otherwise>
            </xsl:choose>
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="$wordsSoFar"/>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

When run against the following XML file:

<root>
  <text>This is a text, that is a text</text>
</root>

it produces the following output:

that, text, This

Note that it does not handle case, so 'Text' and 'text' are different 
words. I only have so much time to fiddle, so I didn't get that far. Also, 
I expect that other, more-experienced, folks around here can produce a 
better implementation. Still, this one works.

Jay Bryant
Bryant Communication Services




JBryant(_at_)s-s-t(_dot_)com 
12/10/2004 01:32 PM
Please respond to
xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com


To
xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
cc

Subject
Re: [xsl] how to extract words from a text






And look at substring-after() or substring-before() and a recursive 
template...

Bingo. If I were going to try this, I would write a recursive template 
that nibbled the first word off the string, checked its length, kept it if 

3+ characters or tossed it if too short, and then passed the remaining 
string to the next instance of the template. Once no spaces remain in the 
string, it's done.

Jay Bryant
Bryant Communication Services




António Mota <xptm(_at_)sapo(_dot_)pt> 
12/10/2004 01:05 PM
Please respond to
xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com


To
xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
cc

Subject
Re: [xsl] how to extract words from a text






I have no idea too, specially on a friday this hour...

But maybe this give _you_ something to think about. It's a "word count" 
method.

<xsl:variable name="txt"><xsl:value-of select="text" /></xsl:variable>
<xsl:variable name="x" select="normalize-space($txt)" />
<xsl:variable name="y" select="translate($txt, ' ', '')" />
<xsl:variable name="wc" select="string-length($x) - string-length($y) +1" 
/>

so wc (word count) in your example will be 8...

And look at substring-after() or substring-before() and a recursive 
template...


Quoting Jan Limpens <jan(_dot_)limpens(_at_)gmail(_dot_)com>:

hello again,

I hope you can help me with this one just as well, as with my other
question today! :)

i have a xml document
<root>
<text>This is a text, that is a text</text>
</root>

and I need to extract every word from it - once, ignoring case, and
ordered by ocurrence, stripping 1-2 letter words - to make a meta
keywords tag from it...

<meta name="keywords" content="text, that, this"/>

the horror! the horror! I have no idea how to do this! :)

thanks again!
--
Jan
http://www.limpens.com

Otakoo Saloon Cartoon - newest episode at http://limpens.com/oscredirect

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--







O SAPO já está livre de vírus com a Panda Software, fique você também!
Clique em: http://antivirus.sapo.pt

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--