RE: [xsl] A better xsl:analyze-string

I like the avoidance of the clumsy numbered capture groups, (my non-starting
proposal would be to add perl capture variables ala $1,$2,$3,$@,$'$`,etc...)

But how would you retrieve the value of the matching subgroup (the decimal
portion) in:
<xsl:matching-substring regex="\d+(\.\d*)?">

There's something asymmetric about your proposal that bothers me.  There are
other cases of combining multiple capture groups that wouldn't get the same
special treatment (like the nesting in the example).  Why assume that
capture groups are always combined as (...)|(...)?  It's a very special
case: is it so common as to warrant special syntax?

-Mike

-----Original Message-----
From: Pavel Minaev [mailto:int19h(_at_)gmail(_dot_)com] 
Sent: Thursday, August 20, 2009 5:59 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] A better xsl:analyze-string

On Thu, Aug 20, 2009 at 2:39 PM, Michael Kay<mike(_at_)saxonica(_dot_)com> 
wrote:

It's true that using regex-group() is a pretty messy

mechanism, and it

would be nice to do better.

Things get a bit more complicated if there are groups that

can match

more than once, of the form (...)*. It's not clear how that

would work

with your proposed syntax.

One of the constraints is that we want to ensure that the

facilities

can be implemented on top of popular regex libraries such as those 
used by Java, C#, or Perl. These are all very heavily based on the 
concept of numbered captured groups, with all their quirks.


I'm not sure I understand. How having (...)* would affect the 
syntax or semantics in any way?

The straightforward implementation of this that I imagine is 
a simple rewrite. If we have matching-substring instructions 
for regexes rx1, rx2, .... rxN, the implementation rewrites 
it as a single regex:

  (rx1)|(rx2)|...|(rxN)

and then counts the parentheses to determine the group number 
of each of the original tokens. From there it's a trivial 
rewrite to choose/when form. Counting parentheses is 
sufficient per the spec for
regex-group() function:

"The Nth captured substring (where N > 0) is the string 
matched by the subexpression contained by the Nth left 
parenthesis in the regex. "

So any quantifiers on groups shouldn't affect this. It would, 
of course, also have to correct group number for any user call to
regex-group() from within matching-substring, but that is 
similarly trivial.

By the way, as a side question - what is regex-group() 
supposed to return in XSLT 2.0 at present when the 
corresponding subexpression matches more than once - as it 
may do in (...)* case?

I suggest you post this to the W3C bugzilla database as a

comment on

the spec, which means it will go on the WG agenda for

consideration.

The status section of the spec gives you a pointer.


I wanted to discuss it here first to see if there are any 
obvious design flaws that I've missed, or other relevant 
scenarios that others have encountered. The idea is to submit 
this as a spec comment in the end, yes.

I think that in many real-life cases one can solve this problem by 
doing two levels of matching. For example, you can often do it by 
first tokenizing with space as a delimiter, then matching

each token

against specific regex patterns. This avoids the reliance

on captured subgroups.

In my specific case, I was trying to use the facility to 
parse XPath 1.0 expressions, so tokenizing on space isn't an 
option there. Of course, one can first tokenize using 
analyze-string, and then use
matches() on each token separately, but this is still rather 
inconvenient, as well as a needless performance hit because 
of double matching.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--