xsl-list
[Top] [All Lists]

Re: [xsl] xsl:analyze-string and multiple matching groups

2012-02-01 09:17:14
The way I tackle most regex problems is to break them up. I would do this one by first tokenizing on ';' and then tokenizing each token on ','.

We've inherited from the Perl regex tradition the idea that each capturing subexpression (i.e. each occurrence of "(...)" in the source) captures zero or one substrings. It could have been done better (and perhaps with the new fn:analyze-string() in 3.0 which generates a marked-up version of the input string we should have tried), but everyone is scared of doing things that are beyond the scope of existing regex libraries.

Personally, having just integrated a modified version of Jakarta into Saxon, I'm less scared of this than I was. But that's the way xsl:analyze-string is defined, and it's defined that way for reasons of compatibility with the Perl regex tradition.

Michael Kay
Saxonica

On 01/02/2012 14:42, Florent Georges wrote:
   Hi,

   Let's say I have a string of the form "a:b;c:d;" where there
can be any number of sub-parts of the form "x:y;", that I'd like
to parse using xsl:analyze-string.  With the following regex:

     ^(([a-z]):([a-z]);)+$

which matches indeed, I cannot use the regex-groups to retrieve
all values.  For instance the following:

     <xsl:analyze-string select="'a:b;c:d;'"
                         regex="^(([a-z]):([a-z]);)+$">
        <xsl:matching-substring>
           <group num="0" value="{ regex-group(0) }"/>
           <group num="1" value="{ regex-group(1) }"/>
           <group num="2" value="{ regex-group(2) }"/>
           <group num="3" value="{ regex-group(3) }"/>
           <group num="4" value="{ regex-group(4) }"/>
           <group num="5" value="{ regex-group(5) }"/>
           <group num="6" value="{ regex-group(6) }"/>
           <group num="7" value="{ regex-group(7) }"/>
        </xsl:matching-substring>
     </xsl:analyze-string>

returns the following:

     <group num="0" value="a:b;c:d;"/>
     <group num="1" value="c:d;"/>
     <group num="2" value="c"/>
     <group num="3" value="d"/>
     <group num="4" value=""/>
     <group num="5" value=""/>
     <group num="6" value=""/>
     <group num="7" value=""/>

when I would have expected the following instead:

     <group num="0" value="a:b;c:d;"/>
     <group num="1" value="a:b;"/>
     <group num="2" value="a"/>
     <group num="3" value="b"/>
     <group num="4" value="c:d;"/>
     <group num="5" value="c"/>
     <group num="6" value="d"/>
     <group num="7" value=""/>

   That is, I expected the regex-groups to match the "dynamic"
number of groups, instead of the strict "static" or "lexical"
group numbering from the regex string.  I thought that was what
I was used to in Perl and other tools, by I can't recall for
sure, and I didn't find a definitive answer in the spec.

   Are my expectations wrong?  If yes why?  And if yes, is there
any general solution to this problem? (by "general", I mean not
recursing on the string and using substring on ';' because here
this is a simple delimiter)

   BTW, tested with Saxon HE 9.3.0.5 and 9.4.0.2.

   Regards,

--
Florent Georges
http://fgeorges.org/
http://h2oconsulting.be/

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>