xsl-list
[Top] [All Lists]

Re: [xsl] finding and removing duplicate string

2011-12-02 11:22:15
Unless your <p>-paragraphs aren't very long you should not use pattern
matching like this because this is a pattern that exhibits quadratic
performance depending on the string length.

I ran a quick test comparing Java's regex engine to the substring
comparison proposed here earlier on.

The "hit" case (2 x "the quick brown..."):
   pattern:  0.000003061s - substr:  0.000000134s, a factor of 22

The "fail" case ("the quick brown..." vs "okkokoko...", equal lengths)
   pattern:  0.000004452s - substr:  0.000000026s, a factor of 171

Some XSLT regex engine might be better, but its execution time is
still bound to increase by O(n^2).

-W


On 2 December 2011 17:29, Imsieke, Gerrit, le-tex
<gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de> wrote:
 <xsl:template match="p">
   <xsl:copy>
     <xsl:copy-of select="@*" />
<!-- use replace() for normalizing the input first, i.e., replace the
newline with a space: -->
     <xsl:analyze-string select="replace(., '\s+', ' ')"
regex="^(.+)\s+\1$">
<!-- \1 is a back-reference to the first match, which is allowed according
to http://www.w3.org/TR/xpath-functions/#regex-syntax -->
       <xsl:matching-substring>
         <xsl:value-of select="regex-group(1)"/>
       </xsl:matching-substring>
       <xsl:non-matching-substring>
<!-- output the whole string if above regex doesn't match: -->
         <xsl:value-of select="."/>
       </xsl:non-matching-substring>
     </xsl:analyze-string>
   </xsl:copy>
 </xsl:template>


On 2011-12-02 16:32, Jacob L wrote:

All,


I am using<xsl:stylesheet version="2.0" .If in the input XML file,
the text in the<p>  tag repeats itself such as



<text>

<p>Bradley Cooper named People’s ‘Sexiest man alive 2011”  Bradley
Cooper named People’s ‘Sexiest man alive 2011”</p>

</text>



I want to write code to check it and omit it. The result should be:-



After putting check in the xsl and deleting the duplicate string. The
output should be:-



 <text>
        <p>Bradley Cooper named People’s ‘Sexiest man alive 2011”</p>
   </text>


Thanks for the help!

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or 
e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--