Comments attached...
EXECUTIVE SUMMARY
I start with my conclusions:
1) regex-enabled templates are possible in XSLT 1.0 today (with
the use of Java extensions, as possible in Saxon or Xalan.)
2) the features we need in future XSLT which would make this
more useful are:
a) variables in xsl:template/@match patterns (which is
currently not allowed.)
They are allowed in XSLT 2.0
b) a meachanism to fail a template and try the next
eligible template. (This turns out to be the most
critical feature for making XSLT work for a reasonably
powerful "up-translator".)
This is a "could" in the XSLT 2.0 requirements list and we've just
started reviewing whether to do anything about this, so any use cases
will be welcome - send them please to public-qt-comments(_at_)w3(_dot_)org
c) extend the XSLT processing model with some tail recursion
elimination or add a built-in feature for tokenizing
text nodes. (May already be provided in Saxon, may be
just an implementation issue.)
My feeling is that tail-recursion is an implementation issue, though I
know that some FP languages essentially mandate that implementatons
support it.
Saxon (incidentally) never does tail recursion of an apply-templates
call, it only does it for call-template. No good reason - I just never
thought of doing it.
3) the new xsl:analyze-string funcitions and the XPath regex
support that has been developed in parallel may not be a
sufficient substitute for the method I am describing here.
It would be interesting to see use cases that demonstrate what the
limitations are.
OVERVIEW OF THE APPROACH
<snip/>
The typical processing model for parsing a text node after
xsl:apply-templates with a text node selected is to match the
head of the text node to a regular expression, consume the
matching head and generate a new text node that is the
unmatched tail. The tail is then selected in a recursive
xsl:apply-template statement.
Interesting approach. Generally, creating nodes is expensive. It also
requires a lot of specification work to sort out the detail, e.g. what
is the parent of the node, what is its base URI, do you get a new text
node each time or can the system reuse them? I think a mechanism based
on strings (like xsl:analyze-string) is more flexible than one based on
text nodes.
HOW DOES XSLT/XPath 2.0 REGEX SUPPORT HELP HERE?
On the surface, the new XPatch regex support would obsolete
the ORO-Matcher and my regex wrapper object. However, the
two functions that my wrapper served were:
- keep a symbol table of regexes to avoid recompiling them
I think that's something that an implementation can easily do behind the
scenes.
- keep a regex with an internal state (caching the last match)
to avoid frequent re-matching of the same text or pieces
of it
I'm not at all sure that this fits well into the functional programming
model.
- allow these regex objects to appear in xsl:template/@match
patterns
Particularly if you add the new xsl:analyze-string form into
the mix, the need for these kinds of things may be entirely gone.
But, I keep coming back to the analogy of xsl:template
matching to regex pattern matching. Having the matching rules
handled by real XSLT templates with regex in the @match
pattern is quite intuitive and much more generally useful
than the simple tokenization that happens in the
analyze-string form. The analyze-string form can only test a
single regex, but in text parsing you need to try many
patterns against the current head of the unparsed text.
Can't this be handled fairly intuitively by using the fn:matches()
function in conjunction with xsl:analyze-string?
What I think would be really useful is if you wrote up your example use
case using the XSLT 2.0 / XPath 2.0 facilities, so that we could see
where the difficulties really are. At present, your note reads as if you
have decided on one design approach, and you are not really prepared to
consider reworking it to use the XSLT 2.0 constructs as they were
designed to be used.
I would also add that general-purpose parsing (like, writing a COBOL
compiler in XSLT) was not really the application we had in mind. The
real test is whether the facilities are adequate to analyze the
structure found in the text of typical data files. I've used them for
"screen-scraping" data downloaded in HTML and found them quite workable,
though it needed several passes.
Michael Kay
Software AG
home: Michael(_dot_)H(_dot_)Kay(_at_)ntlworld(_dot_)com
work: Michael(_dot_)Kay(_at_)softwareag(_dot_)com
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list