xsl-list
[Top] [All Lists]

Re: Transforming milestone tags

2004-07-14 09:01:49
Dieter,

This is the topic of my paper this year at the Extreme conference. And others. Multiple concurrent hierarchies is a hot problem. If you contact me off-list, I can provide you with more info. Extreme is only three weeks away, and putting its Proceedings together is one of the things I'm working on when I'm not writing to this list. :->

Wednesday August 4 will be "Overlap Day" at Extreme this year: see the program at http://www.extrememarkup.com/extreme/2004/wednesday.asp. You may notice that no less than four of Wednesday's abstracts start with the same string (is this a case of overlap?): "Overlap in markup occurs where some markup structures do not nest...".

The short version of the story is that this is most easily done by handling the markup quite differently from the way XSLT expects to. It can be done with XSLT fairly simply (that's what my paper is on), but it's highly unorthodox. In your case, a simple approach would be to process the input in two passes, one to flatten all the markup into milestones, the next to write the flat stuff out again with the hierarchy you want.

But no guarantee even of well-formedness can be made about the output, using current tools, which is one reason why this is an interesting research area. We'd like to get to that point, but this will require implementing LMNL (http://www.lmnl.net) or something similar.

Your data looks like near-TEI. The TEI folks (who, like the OSIS project, have to deal with overlap more than a little) are watching this space. :->

Cheers,
Wendell

At 03:45 AM 7/14/2004, you wrote:
I have a source document which uses a hierarchical to markup the structure of the text of a manuscript (<div> for the big divisions and <p> for the paragraphs) and milestone tags for page breaks (<pb>) and line breaks (<lb>), which may occur in virtually any place inside the hierarchy, for example:

<doc>
  <pb n="1" />
  <div>
    <p>Line A
    <lb/>Line B
    <pb n="2" />
    <lb/>Line C
    </p>
    <p>Line D
    <lb/>Line E
    <lb/>Line F
    </p>
    <pb n="3" />
    <p>Line G
    <lb/>Line H
    <lb/>Line I
    </p>
  </div>
  <div>
    <p>Line J
    <lb/>Line K
    <lb/>Line L
    </p>
  </div>
</doc>

I would like to transform this document into a nested structure of <page> and <line> tags and markup the textual divisions as milestones:

<doc>
  <page n="1">
    <newdiv/>
    <newp/>
    <line n="1.1">Line A</line>
    <line n="1.2">Line B</line>
  </page>
  <page n="2">
    <line n="2.1">Line C</line>
    <newp/>
    <line n="2.2">Line D</line>
    <line n="2.3">Line E</line>
    <line n="2.4">Line F</line>
  </page>
  <page n="3">
    <newp/>
    <line n="3.1">Line G</line>
    <line n="3.2">Line H</line>
    <line n="3.3">Line I</line>
    <newdiv/>
    <newp/>
    <line n="3.4">Line J</line>
    <line n="3.5">Line K</line>
    <line n="3.6">Line L</line>
  </page>
</doc>

What is the best strategy to do this? (My main problem is to get a selection of nodes spanning between <pb> tags appearing on different levels in the hierarchy.)


======================================================================
Wendell Piez                            
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================



<Prev in Thread] Current Thread [Next in Thread>