Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for

Hi,

It has been almost a year now since we discussed about this topic

under the Subject: "XSLT match with regex what's the best currentsolution?" And I left with a vague but excited announcement that I

had a solution and was working on perfecting it.

Then I was actually trying to publish this somewhere, and I picked
Dr. Dobbs, and that was a mistake. I held the discussion away from
the list worried about prior publication issues, but the whole
DDJ thing was a complete waste of time. The editors seem to be
very disorganized, I never received a response, when I called the
editor on the phone he said he was quite interested and would
respond in a few days and that was the last thing I heard in months.
Enough of that!

Now I am converting my stuff to XSLT 2.0 / XPath 2.0 and found
that there had been some progress made on the regex front. So,
may be it is too late now, but please consider the following
description on what I did. It's a bit long, but it isn't an
easy thing. Nevertheless XML up-translation keeps being an
important issue that deserves the discussion. The good news is
that XSLT is so very cool that one can do these things very
nicely.


EXECUTIVE SUMMARY

I start with my conclusions:

1) regex-enabled templates are possible in XSLT 1.0 today (with
   the use of Java extensions, as possible in Saxon or Xalan.)

2) the features we need in future XSLT which would make this
   more useful are:

   a) variables in xsl:template/@match patterns (which is
      currently not allowed.)

   b) a meachanism to fail a template and try the next
      eligible template. (This turns out to be the most
      critical feature for making XSLT work for a reasonably
      powerful "up-translator".)

   c) extend the XSLT processing model with some tail recursion
      elimination or add a built-in feature for tokenizing
      text nodes. (May already be provided in Saxon, may be
      just an implementation issue.)

3) the new xsl:analyze-string funcitions and the XPath regex
   support that has been developed in parallel may not be a
   sufficient substitute for the method I am describing here.


OVERVIEW OF THE APPROACH

In XSLT 1.0, I had successfully devised a method to do some
powerful regular expression based parsing of free text in XSLT
using an XSLT processor that allows Java extension function to
be called (e.g., Saxon). I created a very simple wrapper
around the Apache/jakarta ORO regular expression package,
imported this wrapper and directly some ORO classes into XSLT.
Given this modest prep-work I can use regexes in xsl:template/
@match patterns. Once XSLT selects a template, in the body
of the template I can refer to parenthesis-subexpressions
(aka "groups") without having to call the regex matcher
twice.

The typical processing model for parsing a text node after
xsl:apply-templates with a text node selected is to match
the head of the text node to a regular expression, consume
the matching head and generate a new text node that is the
unmatched tail. The tail is then selected in a recursive
xsl:apply-template statement.


EXAMPLES

Here I show actual code examples. Notice that all code presented
here is Copyright (c) 2002 Regenstrief Institute. All rights reserved.
And I assert the GNU Public License for this code.

PREPWORK

I start my XSL transforms with the following header, notice the
xmlns:regex and xmlns:match elements.

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:regex="java:regex"
  xmlns:match="java:org.apache.oro.text.regex.MatchResult"
  xmlns:saxon="http://icl.com/saxon";
  xmlns:exsl="http://exslt.org/common";
  extension-element-prefixes="saxon exsl"
  exclude-result-prefixes="regex match saxon">

Appart from the ease of integrating Java extensions in saxon,
I only need the saxon extension element node-set because the
exsl:node-set does not allow strings to be turned into text
nodes whereas saxon:node-set does.

[This node-set issue was necessary in XSLT 1.0 but is now
*largely* irrelevant in XSLT 1.1 and XSLT 2.0. However, there
is still sometimes the need to convert a string into a text
node, which I don't know right now how to do.]

The match extension directly uses the jakarta ORO class
MatchResult. The regex extension links to a simple wrapper
whose purpose is to:

- Hide the prep-work for setting up the ORO matcher from XSLT
  to express a simple API to the XSLT.

- Provide a hash map of regex pattern names and objects that
  contain a regex pattern string, and ORO Pattern object,
  and the last MatchResult.

Since XSLT doesn't allow variables in xsl:template/@match
patterns, I need to refer to templates symbolically. My
template-wrapper has a symbol-table mapping regex-name to
a compiled regex wrapper object. These regex objects are
stateful and remember their last match.


XSLT TEMPLATES THAT MATCH ON REGEX

Any of my regex enables templates begins like this example:

<xsl:template mode="..."
    match="text()[regex:match-by-name('my-pattern',.)]">

the function regex:match-by-name(pattern-name, string)
refers to a regex object that has previously been set up
as follows:

<xsl:variable name="my-pattern"
   select="regex:new('my-pattern', '^...')"/>

where '^...' is the regex pattern string accoring to ORO's
Perl5 regex (the most powerful ones.) If the object is to
parse the whole text node, I usually want the pattern anchored
at the start of the string.

Notice that I use the same name for the XSLT variable of that
regex object and for the name of the regex (managed in my little
wrapper's regex symbol table), just to keep things understandible.

I have to use that stored name in the regex:match-by-name
function because XSLT template match patterns cannot have
variables in them. Otherwise I might have preferred to start
the template like this:

<xsl:template mode="..."
    match="text()[regex:match($my-pattern,.)]">

I still want to create these regex objects that store the
regex pattern and the last match, in order to avoid double
execution of the matcher when we get into the template
body.


USING REGEX IN TEMPLATE BODIES

Once I have a match, in the template-body I usually want to
extract sub-strings matched by the regex patterns. To reduce
clutter, the first thing I do is put an XSLT variable handle
on the match:

  <xsl:variable name="this-match"
      select="regex:get-match($my-pattern)"/>

The regex:get-match function returns an ORO MatchResult object
that we prefix with 'match:' here. So, I can now use the whole
matching text as an XSLT text node by:

  <xsl:variable name="this-text"
      select="saxon:node-set(match:group($this-match, 0))"/>

Or I can extract the parenthetic groups as:

  <xsl:variable name="group-1"
      select="saxon:node-set(match:group($this-match, 1))"/>
  <xsl:variable name="group-2"
      select="saxon:node-set(match:group($this-match, 2))"/>
  ...
  <xsl:variable name="group-n"
      select="saxon:node-set(match:group($this-match, $n))"/>

Of course I give names to the group variables that are more
descriptive about my interpretation of this text. Those
variables can then be conveniently used to construct the
XSL structure describing the parsed thing.


PARSING THE WHOLE TEXT NODE

After one match is found, we typically want to continue parsing
the rest. It would be kind-of neat if XSLT has some primitive
that would allow me to send the unmatched rest of the text node
back to the main apply-template loop. However, that would simply
be an optimization of tail recursion, and tail-recursion I can do
today in XSLT (and Saxon does that at acceptable speed and
reliability even for longer input text.)

I can do tail recursion by including the following near the end
of the xsl:template body:

    <xsl:apply-templates mode="..."
       select="saxon:node-set(substring(.,
                   match:end-offset($this-match,0)+1)"/>

as you can see, I extract the trailing substring that was not
matched by the regex from the text-node, convert this trailing
substring to an text-node again, and then I apply the same regex-
enabled templates on that unparsed rest.

If all I had was some record-per-line flat data file, like
a "csv" dump from your favorite database or spreadsheet app,
this is all I would need. I could even eliminate tail recursion
by using some input filter that splits a text node per line.



Many up-translation cases however are more complex and require
some more sophisticated parser. One accepted model of a parser is
that of a "Push-down Automaton", i.e., an automaton described by
a state-transition diagram but with the ability to push part
of he parse on the stack and continue parsing of substructures.

Another model of a parser is a top-down recursive descent parser,
where you try matching your start symbol and in order to get that
the parser will parse it's pieces, and the pieces of the pieces
and so forth.

I guess am using some hybrid form between those two and I
heavily rely on XSLT modes to describe the state of my parser
at every given point in the execution. This works well for
hierarchical data, such as can be found in traditional EDI
encodings or some texts with itemized lists. There I have one
mode per level in the hierarchy. For example, I may have a
batch of reports where each report has a subject and a
list of events and each events has a list of details. Then I
have the modes: report-batch, report, and event (every
non-terminal symbol becomes a mode.)

REPORT
  SUBJECT
  EVENT
    DETAIL
    DETAIL
    ...
  EVENT
    DETAIL
    DETAIL
    ...
  ...
...

Now, each template will use the apply-template trick on the
unmatched text (the $rest) at least two times:

- when descending into the deep-structure of the text
  (I call that "parse-down")

- when doing the tail recursion as shown above (I call that
  "parse-along")

Every regex-based template will first do the parse-down on
its own unmatched text, and then does the parse-along on the
rest of the parse-down. In order to get to the rest we will
have to catch the output of the parse-down in a variable
as a result tree fragment (RTF) and then convert the RTF
to a node-set, make a copy-of everything but the last text
node and then do parse-along on that last text node.

Here is an example:

<xsl:template mode="report-batch"
   match="text()[regex:match-by-name($report-header,.)]">

  <xsl:variable name="this-text"
     select="saxon:node-set(match:group($this-match, 0))"/>

  <!-- parse down, switching mode -->

  <xsl:variable name="parse-down-RTF">
    <xsl:apply-template mode="report"
        select="saxon:node-set(substring(.,
                   match:end-offset($this-match,0)+1)"/>
  </xsl:variable>
  <xsl:variable name="parse-down"
     select="exsl:node-set(parse-down-RTF)"/>

  <xsl:variable name="report-body"
     select="$parse-down/node()[not(position()=last())]/>

  <!-- parse along, using same mode as ours -->
  <xsl:apply-templates mode="report-batch"
     select="$parse-down/text()[position()=last()]/>

</xsl:template>

I also define a default template for each mode that simply
returns the text node:

<xsl:template mode="report-batch" match="text()">
  <xsl:copy-of select="."/>
</xsl:template>

This will always put the yet unparsed rest at the end of
the current result tree fragment. When the down-parse
terminates because no more matching patterns can be found
at the beginning of the current text node, that remaining
unparsed text node at the end is still used for the parse-
along phase. If that fails it is handed up the parse-down
stack and so forth.

It is O.K. for a parse template to return a text node,
as long as there will be a final node at the end that
can be passed into the parse-along chain.


PROBLEMS

Now, as you see, one can do quite a bit of text parsing using
some of these patterns I described here. The nice thing about
XSLT is that it is only natural to add some higer level abstractions
to the XSLT language as extensions and then use an XSLT transform
to generate excutable XSLT from that more abstract definition
of the parser. I can go into more detail of this if you want,
but my abstraction is at this point too prototypical and the real
problems make it produce quite ugly XSLT code. So I want to talk
about the problems first.

A typical free text parser often requires adding some semantic
checking components in the rule that decide what symbol we
currently have. In other words, the decision of whether a
template matches is not just based on the mode and the regex
pattern, but also on other circumstances. For example, in a
semi-structured report, one may have to use the indention as
an indication for whether we have to descend into the deep
structure or if the next text stands for an element on the
same or a higher level of the logical structure of the input
text. For instance if we want to parse a list like this:

PROCARIONTS
  BACTERIAE
EUCARIONTS
  PROTOZOA
    ALVEOLATES
  METAZOA
    ANNELIDA
    NEMATODA
    ARTHROPODA
    CHORDATA
      MAMMALIA
        FELIDES
        CANINES
        APES


we use the indention level to decide whether to parse down or
to parse along or return to the next higher level. So, we would
like a rule like this

  <xsl:variable name="item-pattern"
       select="regex:new('item-pattern',
              "^\n?( *)([A-Z]*( [A-Z])*)(?=\n)"/>

  <xsl:template mode="item"
      match="text()[regex:match-by-name('item-pattern',.)]">

    <xsl:param name="parent-indent" select="-1"/>

    <xsl:variable name="this-match"
        select="regex:get-match($item-pattern)"/>
    <xsl:variable name="this-indent"
        select="saxon:node-set(match:group($this-match, 1))"/>
    <xsl:variable name="item-name"
        select="saxon:node-set(match:group($this-match, 2))"/>

    <!-- should process this item at the current level? -->
    <xsl:if test="string-length($this-indent)
             &lt;=string-length($parent-indent)">
      <!-- no, so try another template -->
      <xsl:next-match/>
      <!-- but there is no xsl:next-match instruction! -->
    </xsl:if>

    <!-- parse down in the next lower level -->
    <xsl:variable name="parse-down-RTF">
      <xsl:apply-template mode="item"
        select="saxon:node-set(substring(.,
                   match:end-offset($this-match,0)+1)"/>
        <xsl:with-param name="parent-indent" select="$this-indent"/>
      </xsl:apply-template>
    </xsl:variable>
    <xsl:variable name="parse-down"
       select="exsl:node-set(parse-down-RTF)"/>

    <!-- generate result element -->
    <ITEM NAME="$item-name">
       <xsl:copy-of
          select="$parse-down/node()[position()=1
                              or not(position()=last())]/>
    </ITEM>

    <!-- parse along in the current level -->
    <xsl:apply-templates mode="item"
       select="$parse-down/text()[not(position()=1)
                                  and position()=last()]>
       <xsl:with-param name="parent-indent" select="$parent-indent"/>
    </xsl:apply-templates>
  </xsl:template>

The "xsl:next-match" element would have aborted the
further processing of this template and try for another
matching template (e.g. a catch-all that returns nothing.)
This would cause the parent's parse-down apply-template
statement to return, and the parent would finish the
current item and then parse along. If that would fail
the control would return to the next higher parent,
etc.

The problem is while an element "xsl:next-match" is proposed
for XSLT 2.0 as part of the requirements document, it doesn't
currently exists and is not defined in the XSLT 2.0 working
draft. So, this method does not work.

As a work-around, we can try moving as many of the checks
into the template match pattern. This, however, becomes
very messy (as in this case we would have to combine all the
definitions of the $this-match and the $indent into a
complicated nested call of extension funcitons.) While
this could be done, especially with an XSLT-based compiler
compiler, it cannot possibly work because XSLT does not
allow any XSLT variables in xsl:template/@match patterns,
which we would need to compare to the passed parameter
$parent-indent. This is the critical point where the whole
approach fails with current XSLT. We will see a feature like
<xsl:next-match/> in the near future on XSLT 2.0 or 2.1 working
draft and draft implementations (particularly in Saxon) will
probably come up shortly. However, until then, we will
have to change our strategy just to make it work now.

Instead of using XSLT templates per each regex pattern, I
consolidate all regex-based templates into one template per
mode that starts as follows:

  <xsl:template mode="..." match="text()">

so, we work on any text node. The next step is to consolidate
all parameters that would be passed between the templates
into one place. (This can be a hairy issue because we need to
avoid name clashes.) The rest of the template is divided in
two phases, (1) the regex matching and template selection
phase and (2) an xsl:choose construct where each of the
consolidated template bodies is one alternative. For example,
we would rewrite the above list/item example as follows:

  <xsl:template mode="item" match="text()">
    <xsl:param name="parent-indent" select="-1"/>

    <xsl:variable name="eligible-branches-RTF">
       <xsl:if test="regex:match-by-name('item-pattern',.)">
          <xsl:variable name="this-match"
             select="regex:get-match($item-pattern)"/>
          <xsl:variable name="this-indent"
             select="saxon:node-set(match:group($this-match, 1))"/>

          <!-- check the conditions and, if positive, output an
               XML element of the name of this branch -->
          <xsl:if test="string-length($this-indent)
                    &gt;string-length($parent-indent)">
            <item/>
          </xsl:if>
       </xsl:if>
       <!-- include other alternative templates in this mode -->
    </xsl:variable>
    <xsl:variable name="eligible-branches"
         select="exsl:node-set($eligible-branches-RTF)"/>

    <xsl:choose>
      <xsl:when test="$eligible-branches/item">
        <!-- the following is what was the template body
             only without the parameter declarations -->

        <xsl:variable name="this-match"
           select="regex:get-match($item-pattern)"/>
        <xsl:variable name="this-indent"
           select="saxon:node-set(match:group($this-match, 1))"/>
        <xsl:variable name="item-name"
           select="saxon:node-set(match:group($this-match, 2))"/>

        <!-- parse down in the next lower level -->
        <xsl:variable name="parse-down-RTF">
          <xsl:apply-template mode="item"
            select="saxon:node-set(substring(.,
                       match:end-offset($this-match,0)+1)"/>
            <xsl:with-param name="parent-indent" select="$this-indent"/>
          </xsl:apply-template>
        </xsl:variable>
        <xsl:variable name="parse-down"
                   select="exsl:node-set(parse-down-RTF)"/>

        <!-- generate result element -->
        <ITEM NAME="$item-name">
           <xsl:copy-of
              select="$parse-down/node()[position()=1
                                  or not(position()=last())]/>
        </ITEM>

        <!-- parse along in the current level -->
        <xsl:apply-templates mode="item"
           select="$parse-down/text()[not(position()=1)
                                      and position()=last()]>
           <xsl:with-param name="parent-indent"
              select="$parent-indent"/>
        </xsl:apply-templates>
      </xsl:when>
      <xsl:otherwise>
        <!-- this is the default returning the unparsed text -->
        <xsl:copy-of select="."/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

As a parser is more complex the xsl:choose can become quite an
unwieldy statement and putting all parameter declaration on
top of the one template is certainly not optimal. However, this
construct works. We now see a good use case for an XSLT parser
pre-processor that generates this executable XSLT code from
a more abstract and user-firendly description of the grammar.


RULE CONFLICTS AND PRIORITIES

The distinction between the eligibility-testing and then the
big xsl:choose statement can be used as an advantage because
it allows the parser to check if there are multiple
eligible branches. This can be used to issue a warning to
the user who might then do something about conflict resolution.

Conflict resolution in regular expressions turns out to be
an important problem that perhaps can best be solved by the
user with the feedback from the parser. Ideally a parser
transform would check all regex pattern analytically finding
out if there are strings that can be matched by more than
one pattern. And if so, could find out which of the two is
more specific and automatically assign higher priority to the
more specific pattern. Solving this question analytically,
however, is a hard problem in computer science, and hence we
will have to fall back to heuristics or ask the user for
help.

One possible heuristics is to assign priority to the more
complex regex patterns, i.e., the pattern that has a longer
string representation. Another heuristsics is to do some
parsing of the pattern and compare the size of the generated
finite directed graph (FDG). In addition we would want to
assert priority to the rules with more additional semantic
constraints (such as the once checking for the indent level.)

Whatever the source of the priority assignments is, it will]
ultimately reflect in the sequence of the xsl:when clauses in
the big xsl:choose statement.



HOW DOES XSLT/XPath 2.0 REGEX SUPPORT HELP HERE?

On the surface, the new XPatch regex support would obsolete
the ORO-Matcher and my regex wrapper object. However, the
two functions that my wrapper served were:

   - keep a symbol table of regexes to avoid recompiling them

   - keep a regex with an internal state (caching the last match)
     to avoid frequent re-matching of the same text or pieces
     of it

   - allow these regex objects to appear in xsl:template/@match
     patterns

Particularly if you add the new xsl:analyze-string form into
the mix, the need for these kinds of things may be entirely gone.

But, I keep coming back to the analogy of xsl:template matching
to regex pattern matching. Having the matching rules handled
by real XSLT templates with regex in the @match pattern is
quite intuitive and much more generally useful than the simple
tokenization that happens in the analyze-string form. The
analyze-string form can only test a single regex, but in text
parsing you need to try many patterns against the current
head of the unparsed text.

So, even though developed for XSLT v1.0 the approach to freetext
up-translating described here is still valid. Particularly the
new xsl:analyze-string instruction does not obsolte my approach;
because it can only find a sequence of tokens of one regex.
One could potentially determine an absolute order among all the
possible regexes and then nest those in the xsl:non-matching-substring
instruction. However, this gets very messy and it may not be easily
possible to clearly bring all the rules into a total order when
the number of matching rules grows. Templates with priorities.
are a much more clear and flexible alternative.

At this point, we have to deal with a unwieldy XSLT code either
way, however. With only two slight changes to XSLT, one could
solve this problem nicely in XSLT with nice code. The most important
of these changes is to allow failing a template and tell the
processor to try another matching template (a la xsl:next-match
proposal.)

regards
-Gunther



--
Gunther Schadow, M.D., Ph.D.                    
gschadow(_at_)regenstrief(_dot_)org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org



XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT