Re: [xsl] Top 10 XSLT patterns
2014-04-03 19:48:30
On 4/3/14 11:33 AM, Abel Braaksma (Exselt) wrote:
It will likely be non-trivial to compile such list without a good query
to search through existing stylesheets and known programming challenges.
But from your experience, what patterns do you encounter most often?
Here are some very specific concrete examples which have come up a lot
for us when processing large texts. I don't see how they map to the
patterns you all are discussing, but they are probably combinations of
them in some way?
Something we've had to implement multiple times in various combinations
(XSLT 1, 2, XQuery, JDOM/Java) is what I call the "proem" extractor:
pull out the first N characters (or words) from a document, maintaining
all of the ancestral markup. A more elaborate variant is to extract an
intermediate section that could be defined in various ways (characters N
- N+100, everything between two <mark> elements, etc). I don't know
what to call that -- tree surgery? Typically the idea is to generate
document summaries, hit highlighting, or annotated passages.
Another major problem for us has been reference resolution: in which a
set of documents is marked up with cross references to other documents,
or sub-documents, and the problem is to copy some part of the referenced
document into the reference (as a performance optimization, so it
doesn't have to be looked up later). The basic idea is simple enough,
but is complicated by very large numbers of large documents with large
numbers of references. Another complication is that the document corpus
may be constantly evolving; as new documents are introduced, both
outbound *and inbound* references must be resolved.
There are lots of variants of this reference resolution problem: simple
links, abbreviation expansion, footnote inlining. Footnotes are
especially challenging since they may contain further references to
additional footnotes, so the expansion is recursive (and inevitably,
circular). References might be to non-XML documents and trigger non-XML
processing: specifically for image files, we would typically want to
store a reference to the image file indicating whether it exists (and
where, if we had to hunt for it), its size, format, etc.
A very common feature of all of our pipelines is chunking. The
canonical example is pulling all the chapters out of a book document and
creating standalone chapter documents plus a skeletal book document that
serves as cover page and table of contents. We usually want to preserve
some ancestral markup in the "chapters", and since we are generating new
documents, we need to keep track of references to these documents for
the TOC, for next/previous navigation links and for
translating/resolving other cross-references that were intra-document,
but have become inter-document.
Another dumb thing we do all the time is run a list of XPaths over a
document and save the results into a Java object for easy access in our
application framework. This is just a simplified version of marshalling
(or unmarshalling?) to cross the language barrier (we call it xml
mapping). We also use XSLT to render these XML documents as HTML, but
when we need (usually atomic) values to be handled by our Java
application layer, we want an easy way to extract them from the XML. For
large numbers of paths, I think we would be better off doing this with a
single generated XSLT (so we don't have to traverse the document once
per path), but currently we don't do that.
I hope that's useful.
-Mike
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
|
|