RE: Re: text() word lists

On Mon, 9 Feb 2004 David(_dot_)Pawson(_at_)rnib(_dot_)org(_dot_)uk wrote:

I said:
      Is it possible to remove all numbers too?
    Or is that a part of the lexicographers toolset?


It can be (I'm reliably informed by a linguist sitting
a few desks away), in that someone might be analysing the
text of (say) a motoring magazine. "The A1-M1 link road"
(for UK readers) or "a V6 Engine...or I could have had a V8".
where any comparisons don't make sense without the numbers.

So what is the best way to parameterise these to allow
turning on/off the removal of numbers?  And while
we're at it, turning on/off the removal of hyphens or
other possibly-word-forming characters?

<xsl:template match="/">
<frequencies>
<xsl:for-each-group group-by="." select="
   for $w in tokenize(string(.), '[\s.?!,)(]+')[.] return lower-case($w)">
  <xsl:sort select="count(current-group())" order="descending"/>
  <xsl:analyze-string select="current-grouping-key()" regex="[0-9]+">
    <xsl:non-matching-substring>
      <word><xsl:value-of select="current-grouping-key(), '  -  ',
count(current-group())"/></word>
    </xsl:non-matching-substring>
    <xsl:matching-substring/>
  </xsl:analyze-string>
</xsl:for-each-group>
</frequencies>

</xsl:template>

Seems to work nicely.
  Thanks Michael, very useful.

regards DaveP


---
Dr James Cummings, Oxford Text Archive, University of Oxford
James.Cummings at ota.ahds.ac.uk http://users.ox.ac.uk/~jamesc/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

<Prev in Thread]	Current Thread	[Next in Thread>
Re: Re: text() word lists, (continued) Re: Re: text() word lists, Dimitre Novatchev Re: text() word lists, Dimitre Novatchev Re: text() word lists, James Cummings Re: text() word lists, Dimitre Novatchev RE: text() word lists, McNally, David Re: text() word lists, Dimitre Novatchev RE: Re: text() word lists, David . Pawson RE: Re: text() word lists, Michael Kay RE: Re: text() word lists, David . Pawson Re: Re: text() word lists, David Carlisle RE: Re: text() word lists, James Cummings <= Re: Re: text() word lists, David Carlisle