Ciarán,
On 1/17/2011 5:57 PM, Michael Kay wrote:
This is a tough problem for three reasons:
(a) upconversion is intrinsically hard, because it depends on
recognizing the patterns that occur in the source - it's very much a
heuristic rather than algorithmic process
(b) the XML that you get out of MS-Word is not the easiest thing to
start from, to put it mildly
(c) for the above two reasons, you'll need to use every trick in the
XSLT book, but you lack XSLT experience.
The best advice I can give for this kind of task is to build it as a
pipeline of transformations each of which gets you one step closer to
the target....
I'd hope to have said exactly the same thing as Mike says here, though
not so concisely. As it is, I can start by underlining it. Especially
point (c). (And Jeroen's and Dan's posts are also good advice.)
A little background may also be helpful in developing your strategy:
For practical purposes this task is more or less impossible in
unassisted XSLT 1.0, but it becomes feasible with features in XSLT 2.0
including for-each-group, string processing for regular expressions,
stylesheet functions, and temporary trees (to name only the most
important), none of which are part of XSLT 1.0 and none of which are
beginner-level topics.
Yet the heart of XSLT 1.0, namely processing by templates, remains at
least as important as all these features if not more so, as it's the
basis of everything else. Accordingly, some experience in plain
old-fashioned "down-conversion", of strong XML (not the semantically
weak stuff you are starting with, but more like what you want to produce
out of it) into display formats like HTML, is more or less essential
practice for most beginners.
It is also a useful exercise for another reason: you will get a better
feel for what makes a strong vs. a weak XML format. (If you have a good
head for this kind of work you'll already have an intuitive
understanding of this, but there's nothing like experience.) It's
critical to be able to think through these issues if you're also faced
with the problem of designing the target format for your upconversion.
Even if you use a dictionary encoding standard like TEI more or less off
the shelf, which will help, you'll face issues here.
One way of saying this is that you might find the problems more
tractable in development if you start with the later stages of your
pipeline and then push it backwards (and then forwards again), working
up sample data by hand as you go and taking note of the particulars of
the conversions necessary. This will engage you in an iterative design
process that will expose the complexities of the problems as you work,
rather than face you with all of them in a big tangle from the very start.
Cheers,
Wendell
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--