I'm banging my head against a sequence alignment problem. I have a
feeling that this is straightforward, but I can't put my finger on
what's missing from my attempts.
Suppose I have two inputs like so, where input1//w is always a subset of
input2//w:
<input1>
<w n="1">I</w>
<w n="2">am</w>
<w n="3">a</w>
<w n="4">sequence</w>
</input1>
<input2>
<w>I</w>
<w>am</w>
<w>a</w>
<w>longer</w>
<w>longer</w>
<w>sequence</w>
</input2>
I'd like to get output like so:
<output>
<w n="1">I</w>
<w n="2">am</w>
<w n="3">a</w>
<w n="skipped">longer</w>
<w n="skipped">longer</w>
<w n="4">sequence</w>
</output>
I.e., for each input1//w, @n should be copied to the nearest following
sibling <w> in input2 that matches .; <w>s in input2 that aren't in
input1 should be flagged as "skipped".
P.S.: The use case is aligning an imperfect but timestamped
transcription of an audio file (input1, machine-generated) with a
perfect but not-timestamped one (input2, human-generated).
Thanks much for any help,
Markus
--
Markus Flatscher, Project Editor
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville VA 22904, USA
Courier: 211 Emmet Street South, Charlottesville VA 22903, USA
Email: markus(_dot_)flatscher(_at_)virginia(_dot_)edu
Web: http://rotunda.upress.virginia.edu/
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--