Dieter,
This is the topic of my paper this year at the Extreme conference. And
others. Multiple concurrent hierarchies is a hot problem. If you contact me
off-list, I can provide you with more info. Extreme is only three weeks
away, and putting its Proceedings together is one of the things I'm working
on when I'm not writing to this list. :->
Wednesday August 4 will be "Overlap Day" at Extreme this year: see the
program at http://www.extrememarkup.com/extreme/2004/wednesday.asp. You may
notice that no less than four of Wednesday's abstracts start with the same
string (is this a case of overlap?): "Overlap in markup occurs where some
markup structures do not nest...".
The short version of the story is that this is most easily done by handling
the markup quite differently from the way XSLT expects to. It can be done
with XSLT fairly simply (that's what my paper is on), but it's highly
unorthodox. In your case, a simple approach would be to process the input
in two passes, one to flatten all the markup into milestones, the next to
write the flat stuff out again with the hierarchy you want.
But no guarantee even of well-formedness can be made about the output,
using current tools, which is one reason why this is an interesting
research area. We'd like to get to that point, but this will require
implementing LMNL (http://www.lmnl.net) or something similar.
Your data looks like near-TEI. The TEI folks (who, like the OSIS project,
have to deal with overlap more than a little) are watching this space. :->
Cheers,
Wendell
At 03:45 AM 7/14/2004, you wrote:
I have a source document which uses a hierarchical to markup the structure
of the text of a manuscript (<div> for the big divisions and <p> for the
paragraphs) and milestone tags for page breaks (<pb>) and line breaks
(<lb>), which may occur in virtually any place inside the hierarchy, for
example:
<doc>
<pb n="1" />
<div>
<p>Line A
<lb/>Line B
<pb n="2" />
<lb/>Line C
</p>
<p>Line D
<lb/>Line E
<lb/>Line F
</p>
<pb n="3" />
<p>Line G
<lb/>Line H
<lb/>Line I
</p>
</div>
<div>
<p>Line J
<lb/>Line K
<lb/>Line L
</p>
</div>
</doc>
I would like to transform this document into a nested structure of <page>
and <line> tags and markup the textual divisions as milestones:
<doc>
<page n="1">
<newdiv/>
<newp/>
<line n="1.1">Line A</line>
<line n="1.2">Line B</line>
</page>
<page n="2">
<line n="2.1">Line C</line>
<newp/>
<line n="2.2">Line D</line>
<line n="2.3">Line E</line>
<line n="2.4">Line F</line>
</page>
<page n="3">
<newp/>
<line n="3.1">Line G</line>
<line n="3.2">Line H</line>
<line n="3.3">Line I</line>
<newdiv/>
<newp/>
<line n="3.4">Line J</line>
<line n="3.5">Line K</line>
<line n="3.6">Line L</line>
</page>
</doc>
What is the best strategy to do this? (My main problem is to get a
selection of nodes spanning between <pb> tags appearing on different
levels in the hierarchy.)
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================