A sample of the citation abbreviations that need to be matched (for simplicity,
<i> is used to indicate italics), from the lookup table used by the transforms
(omitting the expansions of the abbreviations that are in the lookup table
also):
<abbr xml:id="st001"><i>Cal. Franklin Papers</i>, A.P.S.</abbr>
<abbr xml:id="st002">Jay, <i>Unpublished Papers</i></abbr>
<abbr xml:id="st003"><i>JCC</i></abbr>
<abbr xml:id="st004"><i>Oxford Classical Dicy.</i></abbr>
<abbr xml:id="st005">U.S. Census, 1790</abbr>
In the incoming XML, abbreviations like those above appear in running text
without wrapper elements. The automated process to add wrappers needs to
operate on string values that often cross <i> boundaries., as in the first two
examples. So one might find in running text:
<note>See for example Jay, <i>Unpublished Papers</i>, 4:123.</note>
which needs to be transformed into
<note>See for example <ref target="st002">Jay, <i>Unpublished
Papers</i></ref>, 4:123.</note>
The XPath //note[matches(., 'Jay, Unpublished Papers')] will match the input
<note>, but the complexity is writing a template that wraps the appropriate
portions of the note in a <ref> element. That's why our preprocessing converts
<i> tags in both input and lookup table to single text characters to make the
string matching relatively simple.
And we do in fact use unusual Unicode for markers in our current transform, the
example I gave substituted markers that would show up in everyone's email.
David
On Apr 8, 2013, at 2:58 PM, Michael Müller-Hillebrand
<mmh(_at_)docufy(_dot_)de> wrote:
David,
Can you give a more complex example, how "variable in structure" those
citations may be. This may also shed some light on the kind of processing you
want to do. Changing tags to characters (why are you using ASCII instead of
some high Unicode character from the private use area?) and then back to tags
seems not a very interesting thing…
- Michael
Am 08.04.2013 um 20:15 schrieb David Sewell <dsewell(_at_)virginia(_dot_)edu>:
I expect this has been discussed here before, but I can't locate any
relevant
discussion, so here goes.
We have input data with many unmarked short-title citations that look like
this:
Sprague, <hi rend="italic">Braintree Families</hi>
We want to wrap them inside another element, in our case a <ref> to the
bibliographic expansion. We have a venerable chain of XSLT 2.0 transforms
that
does this, and pretty well, by preprocessing the data to convert all those
<hi>
tags into a pair of unique ASCII characters, so that we can do
string-matching
operations within a single text node that now includes something like
Sprague, ¢Braintree Families¥
which is easy to handle with xsl:analyze-string. then once we've wrapped all
the
strings we need to, we post-process with xsl:analyze-string to put the <hi>
elements back in.
In practice, given the proper regexes, this works quite well and provides
the
desired output, but I always feel a bit guilty about the hackishness of the
approach. Given that the citations are quite variable in structure (usually
but
not always containing <hi> elements, with various combinations of text nodes
at
start and end), I've never come up with a good general-purpose way to
operate
purely on elements and text nodes without the convert-tags-to-characters
step.
Is there one (or more)?
David S.
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--