xsl-list
[Top] [All Lists]

RE: [xsl] A better xsl:analyze-string

2009-08-20 17:40:11
It's true that using regex-group() is a pretty messy mechanism, and it would
be nice to do better.

Things get a bit more complicated if there are groups that can match more
than once, of the form (...)*. It's not clear how that would work with your
proposed syntax.

One of the constraints is that we want to ensure that the facilities can be
implemented on top of popular regex libraries such as those used by Java,
C#, or Perl. These are all very heavily based on the concept of numbered
captured groups, with all their quirks.

I suggest you post this to the W3C bugzilla database as a comment on the
spec, which means it will go on the WG agenda for consideration. The status
section of the spec gives you a pointer.

I think that in many real-life cases one can solve this problem by doing two
levels of matching. For example, you can often do it by first tokenizing
with space as a delimiter, then matching each token against specific regex
patterns. This avoids the reliance on captured subgroups.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay 


-----Original Message-----
From: Pavel Minaev [mailto:int19h(_at_)gmail(_dot_)com] 
Sent: 20 August 2009 18:51
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] A better xsl:analyze-string

After some recent struggles with xsl:analyze-string, I would 
like to share my thoughts on its current design, and how it 
could be improved for specific scenarios.

On the surface, the construct seems to be very well suited 
for tokenizing plain text input - indeed, judging from its 
semantics of repeatedly applying a regex to the input string, 
this seems deliberate. However, it is very inconvenient to 
figure out _what_ actually matched once it does matches. One 
either has to match the current substring one more time 
against regex for each token in turn, or make each token a 
separate group in xsl:analyze-string/@regex, and see which of 
the groups is non-empty. Say I want to tokenize into numbers, 
identifiers, and the rest, ignoring whitespace. I would have 
to do something like this:

        <xsl:analyze-string select="'abc 123 foo 456'"
regex="(\s+)|(\d+(\.\d*)?)|(\w+)">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(2) ne ''">
                        <xsl:text> number </xsl:text>
                        <xsl:value-of select="."/>
                    </xsl:when>
                    <xsl:when test="regex-group(4) ne ''">
                        <xsl:text> identifier </xsl:text>
                        <xsl:value-of select="."/>
                    </xsl:when>
                </xsl:choose>
            </xsl:matching-substring>
            <xsl:non-matching-substring>
                <xsl:text> unknown </xsl:text>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>

This can get unwieldy really fast, because top-level regex 
groups for tokens will often contain subgroups - even in the 
simple example above this is already the case - and thus the 
indices of token groups are not sequential; and, of course, 
there are no named groups in XSLT regular expressions (which 
is something that might also come in handy).

I was wondering - for a case like this (which, I would 
imagine, is pretty common when parsing non-trivial non-XML 
data) it would've been more convenient to let the instruction 
itself do the branching on tokens. Syntactically, it could 
look like this:

        <xsl:analyze-string select="'abc 123 foo 456'">
            <xsl:matching-substring regex="\s+"/>
            <xsl:matching-substring regex="\d+(\.\d*)?">
                <xsl:text> number </xsl:text>
                <xsl:value-of select="."/>
            </xsl:matching-substring>
            <xsl:matching-substring regex="\w+">
                <xsl:text> identifier </xsl:text>
                <xsl:value-of select="."/>
            </xsl:matching-substring>
            ...
            <xsl:non-matching-substring>
                <xsl:text> unknown </xsl:text>
                <xsl:value-of select="."/>
            </xsl:non-matching-substring>
        </xsl:analyze-string>

That is, an alternate form of xsl:analyze-string which 
doesn't have @regex, but which contains one or more 
xsl:matching-substring instructions that all have @regex on 
them. For every matched substring, the mathcing-substring 
instruction with regex that was matched is used. Otherwise, 
semantics are the same (context item/position/size, 
prohibition on regexes that can match empty strings, etc).

It has a fairly obvious direct translation to the existing 
syntax for xsl:analyze-string, so this really is just 
syntactic sugar, and thus would be easy to implement - in 
fact, it could be done entirely by an XSLT transform. At the 
same time, I believe that it makes a fairly important use 
case so much easier.

Your thoughts?

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>