For those interested in this thread
Here is how I resolved this...
works like a charm on all tests,
and I am pleased with the robustness
Let me first thank all who stepped in.
I got some inspiration from the different posts,
your contributions are highly appreciated
Since the problem is contained in paragraphs, and I can quickly check
whether I have to bother with a revision or not per paragraph,
it does not really slow me down (too much) by having multiple steps
through the data
The thinking was the hard work. The actual XSLT implementation was
not too bad once the algorithm was solid
Let me show you what I did (simplified) taking test 7 as an example
<in original="this old foo is breaking" revision="a new bar
is building" >
<p><b type="stronger">I <i>did not realize that this
</i></b>old foo is breaking <i>this old foo</i></p>
</in>
Pass 1. Take out the structure by making empty element markers (with
id) from each element tag
and in teh mean time put off-set markers at any location where a
matching pattern could start or end
(if "t" is first character in the @original" put a marker in front
of every "t",
if "g" is the last character of @original, place a marker after every "g")
markers are potential-start <ps/> and potential end <pe/>
results in (simplified, I have namespaces, maintain attributes et al.)
<p><start name="b" id="A"/>I <start name="i" id="B"/>did
not realize <ps/>that <ps/>this <end name="i" id="B"/><end name="b"
id="A"/>old foo is breaking<pe/> <start name="i" id="C"/><ps/>this
old foo<end name="i" id="C"/></p>
now actually the hard work is done
Pass 2.
on each <ps/> check if the join of all following text nodes
(normalized one way or another) starts with the normalized @original,
if so upgrade to revision start <rs/>
on each <pe/> check if the join of all preceding text nodes
(normalized one way or another) ends with the normalized @original,
if so upgrade to revision end <re/>
results in
<p><start name="b" id="A"/>I <start name="i" id="B"/>did
not realize that <rs/>this <end name="i" id="B"/><end name="b"
id="A"/>old foo is breaking<re/> <start name="i" id="C"/>this old
foo<end name="i" id="C"/></p>
Pass 3.
structure the revisions, making them real elements
results in
<p><start name="b" id="A"/>I <start name="i" id="B"/>did
not realize that <rev>this <end name="i" id="B"/><end name="b"
id="A"/>old foo is breaking</rev> <start name="i" id="C"/>this old
foo<end name="i" id="C"/></p>
Pass 4.
Move the end tag markers that are inside a revision having a
corresponding start tag marker (hence the id) outside the revision to
right before the revision
Do something similar with start tag markers
results in
<p><start name="b" id="A"/>I <start name="i" id="B"/>did
not realize that <end name="i" id="B"/><end name="b"
id="A"/><rev>this old foo is breaking</rev> <start name="i"
id="C"/>this old foo<end name="i" id="C"/></p>
Pass 5.
Clean up: make the actual replacement in the revision and make the
markers into elements again
<p><b>I <i>did not realize that </i></b><rev>a new bar is
building</rev> <i>this old foo</i></p>
The turning point for me was adding the offset markers,
before I was auto-generating pretty complex regular expressions,
now I got away with a simple ends-with() and starts-with()
If anyone sees a possible improvement here or there, let me know please
Me happy now, thanks for your help
Geert
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--