xsl-list
[Top] [All Lists]

Spelling Othello (Was: Re: [xsl] Text processing on XSLT 2.0)

2005-04-04 14:00:07
I didn't mention that the text I was spelling was the play:

  "Othello"

by William Shakespeare

On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev(_at_)gmail(_dot_)com> 
wrote:
On Apr 5, 2005 6:41 AM, M. David Peterson 
<m(_dot_)david(_dot_)x2x2x(_at_)gmail(_dot_)com> wrote:
Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus
I have realized that even as far as the XSLT 2.0 working draft goes in
regards to bringing Perl'esque type text processing to the XML
developer it is still up to the developer to fine-tune these
capabilities to cover their specific needs.  For example, a spell
checker.

Can anyone who may have extended experience in regards to the
development of such capabilities using XSLT share with the rest of us
your experience?

Hi Mark,

These days I had fun with an f:binSearch() function and then,
logically, with f:spell().

I have a dictionary of about 47000 English wordforms, on which I
search with f:binSearch()

I had to produce a faster fn than the current quadratical
str-split-to-words template -- this is the f:getWords() function.

All these functions can be downloaded from the FXSL CVS (just let me
know if you'd want me to send you the zip archive).

The combination of these functions works quite well.

This transformation (test-FuncSpell.xsl):

<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:xs="http://www.w3.org/2001/XMLSchema";
xmlns:f="http://fxsl.sf.net/";
exclude-result-prefixes="f xs"

 <xsl:import href="../f/func-getWords.xsl"/>
 <xsl:import href="../f/func-spell.xsl"/>

 <xsl:output omit-xml-declaration="yes"/>

<xsl:variable name="vDelim" as="xs:string">
,—:.-&#9;&#10;&#13;'!?;</xsl:variable>

<!-- To be applied on ../data/othello.xml -->
 <xsl:template match="/">
   <xsl:variable name="vwordNodes" as="element()*">
      <xsl:for-each select="//text()/lower-case(.)">
        <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
      </xsl:for-each>
   </xsl:variable>

   <xsl:variable name="vUnique" as="xs:string+">
     <xsl:perform-sort select="distinct-values($vwordNodes)">
       <xsl:sort select="."/>
     </xsl:perform-sort>
   </xsl:variable>

   <xsl:variable name="vnotFound" as="xs:string*"
    select="$vUnique[not(f:spell(.))]"/>

   <xsl:value-of separator="&#xA;"
    select="$vnotFound"/>

   A total of <xsl:value-of select="count($vwordNodes)"/> words
   were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.

   <xsl:value-of select="count($vnotFound)"/> not found.
</xsl:template>
</xsl:stylesheet>

when applied on othello.xml (around 29000 words)

produces this result:

Saxon 8.3 from Saxonica
Java version 1.5.0_01
Stylesheet compilation time: 1140 milliseconds
Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
Building tree for
file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
class net.sf.saxon.tinytree.TinyBuilder
Tree built in 94 milliseconds
Tree size: 18539 nodes, 154557 characters, 0 attributes
Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 0 milliseconds
Tree size: 43 nodes, 143 characters, 22 attributes
Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
using class net.sf.saxon.tinytree.TinyBuilder
Tree built in 188 milliseconds
Tree size: 139140 nodes, 528397 characters, 0 attributes
Execution time: 7015 milliseconds

<a-list-of-567-unknown-words-ommitted/>

   A total of 28622 words
   were spelt, (3669) distinct.

   567 not found.

So, checking 3669 distinct words in 7015  milliseconds makes

 523.02 words/sec.

The actual speed is faster, as the total time includes splitting up
the words and finding the distinct words.

Among the unknown words are such nice words as:

affordeth
affrighted
ariseth
arithmetician
arrivance
bethink
betimes
bewhored

:o)

Cheers,

Dimitre


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--