xsl-list
[Top] [All Lists]

Re: [xsl] Efficient way to check sequence membership

2011-03-02 15:35:15
One way is to have the strings sorted. Then use binary search -- this
is O(log(N)).

There is a standard FXSL function for this:  f:binSearch and here is the code:

 <xsl:function name="f:binSearch" as="xs:boolean">
   <xsl:param name="pSeq" as="item()+"/>
   <xsl:param name="pVal" as="item()"/>
   <xsl:param name="pStart" as="item()"/>
   <xsl:param name="pEnd" as="item()"/>

   <xsl:sequence select=
    "if($pStart gt $pEnd) then false()
       else
        for $mid in ($pStart + $pEnd) idiv 2,
             $sVal in $pSeq[$mid]
           return
             if($sVal eq $pVal) then true()
             else
                if($sVal lt $pVal)
                   then f:binSearch($pSeq, $pVal, $mid+1, $pEnd)
                else
                  f:binSearch($pSeq, $pVal, $pStart, $mid - 1)
       "/>
 </xsl:function>

I have used the f:binSearch() functions for years -- for example in an
efficient spelling checker in which the dictionary is a sequence of
sorted strings. Even 5 years ago it was working with a speed of 3000
words per second when checking Shakespeare's Othello.


Another way is using XPath 3.0 and the Binary Search tree datatype
that I recently implemented.

This is described (and with full code) here:

http://dnovatchev.wordpress.com/2011/02/12/part-2-the-binary-search-data-structure-with-xpath-3-0-deleting-a-node/



Hope this helped :)



On Wed, Mar 2, 2011 at 1:23 PM, Henry S. Thompson 
<ht(_at_)inf(_dot_)ed(_dot_)ac(_dot_)uk> wrote:
A common requirement for me is to check if a particular string is a
member of a set of other strings.  I often need this in an inner loop,
doing it hundreds if not thousands of times.

So, what's the most efficient way to do this?

Consider the example (a real one) of checking to see if a word is what
the IR people call a 'stop' word -- a short common word of little
substance.

Here's a function which checks this in a straightforward way:

<xsl:variable
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|should|now</xsl:variable>

 <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>

 <xsl:function name="my:stop1" as="xs:boolean">
 <xsl:param name="w" as="xs:string"/>
 <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>
 </xsl:function>

For obvious reasons this seems unlikely to be very efficient.

I've come up with 4 other ways to do this, and the best of them is
over 6 times faster than that one, using a realistic set of test
data.  The worst is more than twice as _slow_ as that one.  That's a
factor of 12 between best and worst!

Timing was done using saxon91 on a Windows box, using -repeat:100,
subtracting a baseline function which always returns true w/o checking
anything.  I called the stop function 289 times.  On 95 occasions the
argument was a stop word (drawn from only 23 word types).  There are
104 stop words in $stopPat.

I'll follow this message with another containing the details: skip it
if you don't care, or have a think about what _you_ think is the best
answer before reading.

ht
--
      Henry S. Thompson, School of Informatics, University of Edinburgh
     10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
               Fax: (44) 131 651-1426, e-mail: 
ht(_at_)inf(_dot_)ed(_dot_)ac(_dot_)uk
                      URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged 
spam]

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--





-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--