Re: [xsl] problem with transforming mixed content
2020-08-15 11:26:59
A generic type (a) approach is descibed in my XML Prague 2019 paper
"Splitting XML Documents at Milestone Elements Using the XSLT Upward
Projection Method"
https://www.xmlprague.cz/day3-2019/#projection
The solution consists of three passes in distinct modes:
- preprocess: Turn the colon into a <sep/> element, remove the original
namespace
- main: Use the upward projection splitter on title, splitting at sep,
name the resulting chunks 'title' or 'subtitle' (the result is a
split:chunks document with split:chunk children that have a name attribute)
- prostprocess: unwrap the resulting chunks and rename their top-level
element (which is title, as in the preprocessed document) to the chunk names
Both files, the custom split-title.xsl and the split.xsl package, are
here: https://gist.github.com/gimsieke/24acca99f88ff232ed453d9177700472
When invoking Saxon (at least 9.8 I think), you need to give
-lib:split.xsl as a command line argument.
More info is in the paper (or in the slides).
Gerrit
On 15.08.2020 18:03, Wolfhart Totschnig wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl
wrote:
Thank you, Martin, for the explanation of the unwanted namespace
declaration. I now know how to get rid of it.
And thank you, Michael, for the detailed explanation of possible
approaches to the problem. Graydon's solution will work very well in my
case, I think, since I can test for most error-producing conditions
before applying the code and the probability of further errors seems
sufficiently low in my context and for my purposes. But I am still
curious: What would an approach of type (a) look like in my case? It
seems to me that implementing this approach would again face the
original problem: "turning the punctuation into markup" sounds like a
description of the original problem.
Best,
Wolfhart
On 15.08.20 05:16, Michael Kay mike(_at_)saxonica(_dot_)com wrote:
This problem comes up from time to time, and it's not easy.
There seem to be three general approaches:
(a) turn the punctuation into markup (e.g. turn ":" into <colon/>),
then do the manipulation on a tree of nodes
(b) turn the markup into punctuation, then do the manipulation on the
resulting text.
(c) do it all in one pass
I see that Graydon's solution uses serialize() and parse-xml(), so
that's a modern approach to doing (b); while Dimitre's solution does
(c). In general I think the one-pass solution is often more
complicated and runs the risk of not being extensible when the problem
"evolves".
One of the things that can cause the problem to "evolve" is error
handling: dealing with situations where the input isn't quite as
simple as in your example. For example, multiple colons, no colons,
colons that are there for a different purpose, etc,. You haven't
included any such cases in your requirements statement.
If we ignore error handling, this example of the problem is simpler
than some because the ":" is always going to be in an immediate child
text node; we've seen other examples (like splitting a table) where we
need to look for conditions much deeper in the structure. This is
probably what makes a one-pass solution feasible in this case.
Intuitively, my feeling is that (a) is the most rigorous approach, the
one that is least likely to fail because of unanticipated input
conditions. For example, Graydon's solution fails if the input
contains tags with upper-case names, or if it contains comments with a
colon in the text.
Michael Kay
Saxonica
On 15 Aug 2020, at 03:16, Wolfhart Totschnig
wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl
<mailto:wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl>
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com
<mailto:xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>> wrote:
Dear list,
I would like to ask for your help with the following mixed-content
problem. I am receiving, from an external source, data in the
following form:
<title>THE TITLE OF THE BOOK WITH SOME <i>ITALICS</i> AND SOME MORE
WORDS: THE SUBTITLE OF THE BOOK WITH SOME <i>ITALICS</i></title>
What I would like to do is
1) separate the title from the subtitle (i.e., divide the data at the
colon) and put each in a separate element node;
2) all the while maintaining the <i> markup;
3) and perform certain string manipulations on all of the text nodes;
for the purposes of this post, I will use the example of converting
upper-case to lower-case.
So the desired output is the following:
<title>the title of the book with some <i>italics</i> and some more
words</title>
<subtitle>the subtitle of the book with some <i>italics</i></subtitle>
How can this be done?
I know that I can perform string manipulations while maintaining the
<i> markup with templates, i.e., <xsl:template match="text()"/> and
<xsl:template match="i"/>. But in this case I do not know how to
divide the data at the colon. And I know that I can divide the data
at the colon with <xsl:value-of select="substring-before(.,': ')"/>,
but then I loose the <i> markup. So I am at a loss.
Thanks in advance for your help!
Wolfhart
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
Re: [xsl] problem with transforming mixed content, Michael Kay mike(_at_)saxonica(_dot_)com
- Re: [xsl] problem with transforming mixed content, Wolfhart Totschnig wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl
- Re: [xsl] problem with transforming mixed content, Martin Honnen martin(_dot_)honnen(_at_)gmx(_dot_)de
- Re: [xsl] problem with transforming mixed content,
Imsieke, Gerrit, le-tex gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de <=
- Re: [xsl] problem with transforming mixed content, Graydon graydon(_at_)marost(_dot_)ca
- Re: [xsl] problem with transforming mixed content, Imsieke, Gerrit, le-tex gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de
- Re: [xsl] problem with transforming mixed content, Graydon graydon(_at_)marost(_dot_)ca
- [xsl] Specifying Response Style [Was: problem with transforming mixed content}], B Tommie Usdin btusdin(_at_)mulberrytech(_dot_)com
- Re: [xsl] Specifying Response Style [Was: problem with transforming mixed content}], Imsieke, Gerrit, le-tex gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de
Re: [xsl] problem with transforming mixed content, Wolfhart Totschnig wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl
Re: [xsl] problem with transforming mixed content, Mukul Gandhi gandhi(_dot_)mukul(_at_)gmail(_dot_)com
Re: [xsl] problem with transforming mixed content, Wolfhart Totschnig wolfhart(_dot_)totschnig(_at_)mail(_dot_)udp(_dot_)cl
Re: [xsl] problem with transforming mixed content, Dimitre Novatchev dnovatchev(_at_)gmail(_dot_)com
Re: [xsl] problem with transforming mixed content, Mukul Gandhi gandhi(_dot_)mukul(_at_)gmail(_dot_)com
|
|
|