xsl-list
[Top] [All Lists]

Re: [xsl] Advice on dictionary conversion

2011-01-18 11:35:36
Ciarán,

On 1/17/2011 5:57 PM, Michael Kay wrote:
This is a tough problem for three reasons:

(a) upconversion is intrinsically hard, because it depends on
recognizing the patterns that occur in the source - it's very much a
heuristic rather than algorithmic process

(b) the XML that you get out of MS-Word is not the easiest thing to
start from, to put it mildly

(c) for the above two reasons, you'll need to use every trick in the
XSLT book, but you lack XSLT experience.

The best advice I can give for this kind of task is to build it as a
pipeline of transformations each of which gets you one step closer to
the target....

I'd hope to have said exactly the same thing as Mike says here, though not so concisely. As it is, I can start by underlining it. Especially point (c). (And Jeroen's and Dan's posts are also good advice.)

A little background may also be helpful in developing your strategy:

For practical purposes this task is more or less impossible in unassisted XSLT 1.0, but it becomes feasible with features in XSLT 2.0 including for-each-group, string processing for regular expressions, stylesheet functions, and temporary trees (to name only the most important), none of which are part of XSLT 1.0 and none of which are beginner-level topics.

Yet the heart of XSLT 1.0, namely processing by templates, remains at least as important as all these features if not more so, as it's the basis of everything else. Accordingly, some experience in plain old-fashioned "down-conversion", of strong XML (not the semantically weak stuff you are starting with, but more like what you want to produce out of it) into display formats like HTML, is more or less essential practice for most beginners.

It is also a useful exercise for another reason: you will get a better feel for what makes a strong vs. a weak XML format. (If you have a good head for this kind of work you'll already have an intuitive understanding of this, but there's nothing like experience.) It's critical to be able to think through these issues if you're also faced with the problem of designing the target format for your upconversion. Even if you use a dictionary encoding standard like TEI more or less off the shelf, which will help, you'll face issues here.

One way of saying this is that you might find the problems more tractable in development if you start with the later stages of your pipeline and then push it backwards (and then forwards again), working up sample data by hand as you go and taking note of the particulars of the conversions necessary. This will engage you in an iterative design process that will expose the complexities of the problems as you work, rather than face you with all of them in a big tangle from the very start.

Cheers,
Wendell


======================================================================
Wendell Piez                            
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--