xsl-list
[Top] [All Lists]

[xsl] Aligning/merging two sequences

2010-09-30 11:51:14
I'm banging my head against a sequence alignment problem. I have a feeling that this is straightforward, but I can't put my finger on what's missing from my attempts.

Suppose I have two inputs like so, where input1//w is always a subset of input2//w:

<input1>
 <w n="1">I</w>
 <w n="2">am</w>
 <w n="3">a</w>
 <w n="4">sequence</w>
</input1>

<input2>
 <w>I</w>
 <w>am</w>
 <w>a</w>
 <w>longer</w>
 <w>longer</w>
 <w>sequence</w>
</input2>

I'd like to get output like so:

<output>
 <w n="1">I</w>
 <w n="2">am</w>
 <w n="3">a</w>
 <w n="skipped">longer</w>
 <w n="skipped">longer</w>
 <w n="4">sequence</w>
</output>

I.e., for each input1//w, @n should be copied to the nearest following sibling <w> in input2 that matches .; <w>s in input2 that aren't in input1 should be flagged as "skipped".

P.S.: The use case is aligning an imperfect but timestamped transcription of an audio file (input1, machine-generated) with a perfect but not-timestamped one (input2, human-generated).

Thanks much for any help,

Markus

--
Markus Flatscher, Project Editor
ROTUNDA, The University of Virginia Press
PO Box 400314, Charlottesville VA 22904, USA
Courier: 211 Emmet Street South, Charlottesville VA 22903, USA
Email: markus(_dot_)flatscher(_at_)virginia(_dot_)edu
Web: http://rotunda.upress.virginia.edu/


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>