RE: Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata

Comments attached...



EXECUTIVE SUMMARY

I start with my conclusions:

1) regex-enabled templates are possible in XSLT 1.0 today (with
    the use of Java extensions, as possible in Saxon or Xalan.)

2) the features we need in future XSLT which would make this
    more useful are:

    a) variables in xsl:template/@match patterns (which is
       currently not allowed.)


They are allowed in XSLT 2.0


    b) a meachanism to fail a template and try the next
       eligible template. (This turns out to be the most
       critical feature for making XSLT work for a reasonably
       powerful "up-translator".)


This is a "could" in the XSLT 2.0 requirements list and we've just
started reviewing whether to do anything about this, so any use cases
will be welcome - send them please to public-qt-comments(_at_)w3(_dot_)org


    c) extend the XSLT processing model with some tail recursion
       elimination or add a built-in feature for tokenizing
       text nodes. (May already be provided in Saxon, may be
       just an implementation issue.)


My feeling is that tail-recursion is an implementation issue, though I
know that some FP languages essentially mandate that implementatons
support it.

Saxon (incidentally) never does tail recursion of an apply-templates
call, it only does it for call-template. No good reason - I just never
thought of doing it.


3) the new xsl:analyze-string funcitions and the XPath regex
    support that has been developed in parallel may not be a
    sufficient substitute for the method I am describing here.

It would be interesting to see use cases that demonstrate what the
limitations are.


OVERVIEW OF THE APPROACH

<snip/>


The typical processing model for parsing a text node after 
xsl:apply-templates with a text node selected is to match the 
head of the text node to a regular expression, consume the 
matching head and generate a new text node that is the 
unmatched tail. The tail is then selected in a recursive 
xsl:apply-template statement.


Interesting approach. Generally, creating nodes is expensive. It also
requires a lot of specification work to sort out the detail, e.g. what
is the parent of the node, what is its base URI, do you get a new text
node each time or can the system reuse them? I think a mechanism based
on strings (like xsl:analyze-string) is more flexible than one based on
text nodes.


HOW DOES XSLT/XPath 2.0 REGEX SUPPORT HELP HERE?

On the surface, the new XPatch regex support would obsolete
the ORO-Matcher and my regex wrapper object. However, the
two functions that my wrapper served were:

    - keep a symbol table of regexes to avoid recompiling them


I think that's something that an implementation can easily do behind the
scenes.


    - keep a regex with an internal state (caching the last match)
      to avoid frequent re-matching of the same text or pieces
      of it


I'm not at all sure that this fits well into the functional programming
model.


    - allow these regex objects to appear in xsl:template/@match
      patterns

Particularly if you add the new xsl:analyze-string form into 
the mix, the need for these kinds of things may be entirely gone.

But, I keep coming back to the analogy of xsl:template 
matching to regex pattern matching. Having the matching rules 
handled by real XSLT templates with regex in the @match 
pattern is quite intuitive and much more generally useful 
than the simple tokenization that happens in the 
analyze-string form. The analyze-string form can only test a 
single regex, but in text parsing you need to try many 
patterns against the current head of the unparsed text.


Can't this be handled fairly intuitively by using the fn:matches()
function in conjunction with xsl:analyze-string?

What I think would be really useful is if you wrote up your example use
case using the XSLT 2.0 / XPath 2.0 facilities, so that we could see
where the difficulties really are. At present, your note reads as if you
have decided on one design approach, and you are not really prepared to
consider reworking it to use the XSLT 2.0 constructs as they were
designed to be used.

I would also add that general-purpose parsing (like, writing a COBOL
compiler in XSLT) was not really the application we had in mind. The
real test is whether the facilities are adequate to analyze the
structure found in the text of typical data files. I've used them for
"screen-scraping" data downloaded in HTML and found them quite workable,
though it needed several passes.

Michael Kay
Software AG
home: Michael(_dot_)H(_dot_)Kay(_at_)ntlworld(_dot_)com
work: Michael(_dot_)Kay(_at_)softwareag(_dot_)com 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

RE: Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT