xsl-list
[Top] [All Lists]

RE: How Do I Generate A Set-Difference With Context - Part A

2005-03-12 17:52:23
This continues my earlier post, unfortunately unresponded to, on the same subject. The original post is here: http://www.biglist.com/lists/xsl-list/archives/200503/msg00332.html

I've put on the back burner my initial quest of writing a generic diff-finding stylesheet. While I know I could do this in a, say, procedural programming language (walk the trees, compare children, recurse), and since [I read that] XSLT is Turing Complete, I know I can achieve this in XSLT as well. Just not right away.

The diff quest came from the following problem: I get periodic XML "feeds" from a news syndicate; these feeds are parsed, formatted in HTML, and published on a website. Each feed is an XML file, and contains zero or more "stories". A story may be exactly like that in the immediately-prior feed, may be slightly different, or may be completely new. Hence my desire to "diff" 2 feeds rather than simply regenerate all stories. When only, say, 20 stories change among 1000+ stories, this is a processing win.

I've modified my initial diff quest to one that is content with looking for differences in known nodes of an input document. In this case I pass-through everything other than a <story> node; when I encounter a <story> node I look for a matching story in the prior feed, and pass through if not found. "matching" means local-names are same, attributes are same in number, name and value, and all child nodes of <story> are present and similarly equal to those in the prior <story>, if any. I realize "match" may be application dependant.

I used an augmented vset:difference from "XSLT Cookbook" to determine <story> differences. MSXML 4x (?) from a .NET app took 6-something minutes for a 1200-node transform. Saxon v653, JDK 1.4, from the commandline took 4-something minutes. Then I used keys to "index" stories in the prior, compared-to document. Now Saxon 653 took 42 seconds for the transform. Then I modified vset:difference to "short-circuit" at the first difference; now Saxon 653 took 30 seconds for the transform. I can't think of other algorithmic improvements to make; if anybody else can, please post.

Now for a question: each <story> has, say, a @date attribute; multiple stories in 1 feed may have the same @date value. If I find a changed (or new) story, I want to pass through not just that story, but also all other stories with that same @date value, no matter they're unchanged.

My first thougth about accomplishing this is to perform some sort of pipeline processing (it's immaterial if it's done in-situ in the XSLT processor). The first pass would generate a diff result, and a second pass would add back all stories of the same [unique] dates as the ones in the diff result.

It would be nice to do this in one [conceptual] pass:
--for every story check if it should pass through
--if a passthrough also pass through otheries stories with the same @date

Of course, a naive approach would have stories output multiply, unless I somehow "remember" what [say, dates] has already passed through. I could do that in a procedural language, but don't know how to [efficiently] in a functional one such as XSLT. While I read up on functional programming techniques, if anybody can suggest a way to accomplish this, please post.

Here are 2 sample feeds:

feed1.xml
======
<feed>
 <story @date="1" text="a"/>
 <story @date="2" text="a"/>
</feed>


feed2.xml
======
<feed>
 <story @date="1" text="a"/>
 <story @date="1" text="b"/>
 <story @date="2" text="a"/>
</feed>

If feed2.xml were the "current" feed, and feed1.xml the propr one, I want to end up with the following, in 1 pass.

final.xml
=====
<feed>
 <story @date="1" text="a"/>
 <story @date="1" text="aa"/>
</feed>

Thank you

--A

_________________________________________________________________
Don?t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--