xsl-list
[Top] [All Lists]

Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?

2010-09-29 08:29:20
Thanks wolfgang,

I tried that regex in regexBuddy and it works effectively and takes
less steps in matching and non matching cases.

^[A-Z][A-Za-z]+(\s+[A-Z][A-Za-z]*)*$

Thanks
Alex

On Wed, Sep 29, 2010 at 10:10 AM, Wolfgang Laun 
<wolfgang(_dot_)laun(_at_)gmail(_dot_)com> wrote:
On 28 September 2010 22:43, Alex Muir 
<alex(_dot_)g(_dot_)muir(_at_)gmail(_dot_)com> wrote:

Well turns out the problem was a combination of factors but was the
following regex which depending on the input given by the other 4
variables would non terminate or run fast or slow... I suppose what
was confusing me the most was that for most files I process it was
running quickly and removing one or another variable led to
improvements just because of chance given the input files.

and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$')

This is really evil, as it will backtrack exponentially. The "\s*"
isn't providing
separation between words; a "\s+" should do the trick. "{0,} isn't wrong,
but why not use "*"?



I wrote this instead

  and 
(matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$')
           or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))">


I don't see the point of using two expressions, or "+?".

To match a string consisting entirely of capitalized words sparated by
white space:

   ^[A-Z][A-Za-z]*(\s+[A-Z][A-Za-z]*)*$

You may add \s* at the end to handle optional trailing white space.

-W

The first one looks for title or upper case words and the second just one 
word.


I see now from the profileroutput makes that clear given that

s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call 
(name="matches")

Takes so long and the calls below it take so little time.


99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk")
s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title")
s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let 
(name="titleBraketedTextRemoved")
s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved")
s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let 
(name="titleStripPunctuation")
s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let 
(name="titleStopWordsRemoved")
s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose
s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call 
(name="matches")
s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call 
(name="normalize-space")
s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call 
(name="mh:removeStopwords")
s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function 
(name="mh:removeStopwords") (as="xs:string?")
s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call 
(name="mh:stripPunctuation")
s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch
s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call (name="not")
s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace")
0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords") 
(select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before', 
'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 
'on', 'or', 'other', 'per', 'such ', 'than', 'that', 'the', 'these', 
'this', 'to' , 'Ñ')"
)

Thanks Much
Alex


On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun 
<wolfgang(_dot_)laun(_at_)gmail(_dot_)com> wrote:
Two comments, which may not shed any light on the non-termination, but 
anyway:

First, the pattern "\([^\)]*\)" is supposed to remove any
parenthesized text, but there's
no point in using "[^\)]" since the set of "any character except ')'
is simply denoted
by "[^)]" becaue a parenthesis is not a meta-character within brackets.

Second, to remove all characters of a kind (single character or class)
it's better
form to use a repetition, e.g.,  "\d+" rather than just "\d".

-W


On 28 September 2010 14:44, Alex Muir 
<alex(_dot_)g(_dot_)muir(_at_)gmail(_dot_)com> wrote:
Hi,

I found something quite interesting which may help further understand the 
issue.

Independently none of the following variable takes long to process,
such that when  I no longer chain the variables together but just run
the template calling only one variable and comment out the others the
time to run is short.

  <xsl:variable name="title"
      select="mh:stripTextNewline(normalize-space(.))"/>

    <xsl:variable name="titleBraketedTextRemoved"
      select="replace($title,'\([^\)]*\)','')"/>

    <xsl:variable name="titleNumberRemoved"
      select="replace($titleBraketedTextRemoved,'\d','')"/>

    <xsl:variable name="titleStripPunctuation"
      select="mh:stripPunctuation($titleNumberRemoved)"/>

    <xsl:variable name="titleStopWordsRemoved"
      
select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords))"/>

As the variables are combined together they take more and more time to
execute and finally if all together they do not stop running.

So initially I was wrong to suggest that the titleBraketedTextRemoved
variable was causing the problem. It's just that the problem is
exacerbated when I finally add that variable into the chain of
variables.

I reduced the size of the input file so that the $title contains one
small line of text in order to get an idea on the profiling however
the processing does not complete.

I'll have to talk to my client later today before posting the full code.

Thanks
Alex







Alex


On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay 
<mike(_at_)saxonica(_dot_)com> wrote:
 I don't know - they are both, I think, using the Java regular expression
engine underneath. It may be a function of how you are measuring it. It
could be that the cost is dominated not by the cost of evaluating the 
regex,
but by the cost of checking that it conforms to the XPath rules. Did you 
run
a Java profile to determine where the time is being spent?

Michael Kay
Saxonica

On 27/09/2010 7:21 PM, Alex Muir wrote:

HI,

I'm unable to figure out why this regex is so very time consuming such
that it does not end in oxygen but works quickly in regex buddy on the
same content.

    <xsl:variable name="BraketedTextRemoved"
       select="replace($title,'\([^\)]*\)','')"/>

I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd )

Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0"

Any Ideas?

Thanks
Alex

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or 
e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--