xsl-list
[Top] [All Lists]

Re: [xsl] Finding first difference between 2 text strings

2009-09-14 14:01:44
What a clever/impressive/compact solution!

David's solution is the one I decided to use because it avoids potential
problems with stack overflow during recursion.  I don't understand all
the details of the function, but that's one advantage of reusable code!
With our data, the regexp processing didn't seem to be stressed too much
since I got reasonable results for strings up to 1000 characters in
length.

Our text strings also contain '(', ')', and '?', so they had to be added
to the list of special characters to be processed.

I suppose the use of the ')' in the function could be replaced by a
character not occurring in the text data.

Since we're also processing just ASCII text, and not Unicode, I replaced
the hex codes in the translation with just a space for each special
character.  The ordering of special characters doesn't matter (to me),
so a blank seemed to work fine.  The hex codes also seemed to throw-off
the resulting position of the mismatch, although I didn't investigate
thoroughly.

My changes to the function amount to the following (with similar changes
for $b):

<xsl:variable name="single-quote">'</xsl:variable>

<xsl:param name="a" as="xs:string" />
<xsl:variable name="aa-pattern" select="concat('.,+*\{}[]()?',
$single-quote)" />
<xsl:variable name="aa" select="translate($a,  $aa-pattern,  '
')"/>

Say, invoke the function as:
<xsl:variable name="pos1" select=" f:mismatch2 ($a, $b)" />

I also went ahead and reversed the strings so that I could find the last
character in the string difference, and then extract the whole section
that was different:

<xsl:variable name="rev-a"
select="codepoints-to-string(reverse(string-to-codepoints($a)))" />
<xsl:variable name="rev-b"
select="codepoints-to-string(reverse(string-to-codepoints($b)))" />
<xsl:variable name="pos2" select=" f:mismatch2 ($rev-a, $rev-b)" />

Then output this string:
substring($a, $pos1, string-length($a) - $pos2 - $pos1 + 2)

or this string, depending on which sub-section is desired for the user
(and, actually, I output both for a "from"/"to" comparison):
substring($b, $pos1, string-length($b) - $pos2 - $pos1 + 2)

Processing time was not excessive, and I got some useful output from our
data.

Thanks again to David and the others who supplied working solutions!

-- Mike Cook



An alternative definition, that appears to give the same results is:

  <xsl:function name="f:mismatch2" as="xs:integer?">
    <xsl:param name="a" as="xs:string" />
    <xsl:param name="b" as="xs:string" />
    <xsl:variable name="aa"

select="translate($a,'.+*\{}[]','&#xe001;&#xe002;&#xe003;&#xe004;&#xe005
;&#xe006;&#xe
007;&#xe008;')"/>
    <xsl:variable name="bb"

select="translate($b,'.+*\{}[]','&#xe001;&#xe002;&#xe003;&#xe004;&#xe005
;&#xe006;&#xe
007;&#xe008;')"/>
    <xsl:variable name="r"
select="concat('^:',replace($bb,'.','($0'),replace($bb,'.',')?'),'.*')"/

    <xsl:sequence
select="1+string-length(replace(concat(':',$aa),$r,'$1'))"/>
  </xsl:function>

If $b is long, this might stretch the capabilities of the regexp
engine
though....

David


This email and any attachments are only for use by the intended recipient(s) 
and may contain legally privileged, confidential, proprietary or otherwise 
private information.  Any unauthorized use, reproduction, dissemination, 
distribution or other disclosure of the contents of this e-mail or its 
attachments is strictly prohibited.  If you have received this email in error, 
please notify the sender immediately and delete the original.



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--