I like the avoidance of the clumsy numbered capture groups, (my non-starting
proposal would be to add perl capture variables ala $1,$2,$3,$@,$'$`,etc...)
But how would you retrieve the value of the matching subgroup (the decimal
portion) in:
<xsl:matching-substring regex="\d+(\.\d*)?">
There's something asymmetric about your proposal that bothers me. There are
other cases of combining multiple capture groups that wouldn't get the same
special treatment (like the nesting in the example). Why assume that
capture groups are always combined as (...)|(...)? It's a very special
case: is it so common as to warrant special syntax?
-Mike
-----Original Message-----
From: Pavel Minaev [mailto:int19h(_at_)gmail(_dot_)com]
Sent: Thursday, August 20, 2009 5:59 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] A better xsl:analyze-string
On Thu, Aug 20, 2009 at 2:39 PM, Michael Kay<mike(_at_)saxonica(_dot_)com>
wrote:
It's true that using regex-group() is a pretty messy
mechanism, and it
would be nice to do better.
Things get a bit more complicated if there are groups that
can match
more than once, of the form (...)*. It's not clear how that
would work
with your proposed syntax.
One of the constraints is that we want to ensure that the
facilities
can be implemented on top of popular regex libraries such as those
used by Java, C#, or Perl. These are all very heavily based on the
concept of numbered captured groups, with all their quirks.
I'm not sure I understand. How having (...)* would affect the
syntax or semantics in any way?
The straightforward implementation of this that I imagine is
a simple rewrite. If we have matching-substring instructions
for regexes rx1, rx2, .... rxN, the implementation rewrites
it as a single regex:
(rx1)|(rx2)|...|(rxN)
and then counts the parentheses to determine the group number
of each of the original tokens. From there it's a trivial
rewrite to choose/when form. Counting parentheses is
sufficient per the spec for
regex-group() function:
"The Nth captured substring (where N > 0) is the string
matched by the subexpression contained by the Nth left
parenthesis in the regex. "
So any quantifiers on groups shouldn't affect this. It would,
of course, also have to correct group number for any user call to
regex-group() from within matching-substring, but that is
similarly trivial.
By the way, as a side question - what is regex-group()
supposed to return in XSLT 2.0 at present when the
corresponding subexpression matches more than once - as it
may do in (...)* case?
I suggest you post this to the W3C bugzilla database as a
comment on
the spec, which means it will go on the WG agenda for
consideration.
The status section of the spec gives you a pointer.
I wanted to discuss it here first to see if there are any
obvious design flaws that I've missed, or other relevant
scenarios that others have encountered. The idea is to submit
this as a spec comment in the end, yes.
I think that in many real-life cases one can solve this problem by
doing two levels of matching. For example, you can often do it by
first tokenizing with space as a delimiter, then matching
each token
against specific regex patterns. This avoids the reliance
on captured subgroups.
In my specific case, I was trying to use the facility to
parse XPath 1.0 expressions, so tokenizing on space isn't an
option there. Of course, one can first tokenize using
analyze-string, and then use
matches() on each token separately, but this is still rather
inconvenient, as well as a needless performance hit because
of double matching.
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--