xsl-list
[Top] [All Lists]

Re: [xsl] check the type of the $pattern argument to a regular expression?

2007-04-16 16:38:37
Dimitre Novatchev wrote:
That it can't be
done using a regular expression is one thing, that it can't be done
with XSL-T 2.0 is incorrect, given that all one would need to check
the validity of regular expressions is a turing complete language.


In fact,  have a general purpose LR(1) parser written cmpletely in
XSLT 2.0, which will parse any input and return the sequence of
poduction used in the parse.

Provide any LR-1 grammar written in BNF and you'll get the checker.


I am unfamiliar with the term LR-1, and likely this is not what it should be, it is just a poor man's attempt to solve the OP's original inquiry: checking the validity of a regular expression. Just by glimpsing over my own solution, I see plenty of parts for improvement. But I didn't want to make it a week-long assignment, so consider it a start for testing the most common mistakes in your (Bryan's) regular expressions.

As a guide, I used the W3 XML Schema specification and the XSLT 2.0 / XPath extensions. The solution the is laid out below supports the following:

 * Correct use of escapes
 * Correct use of parentheses and escaped parenteses
* Escaped meta characters and correct pairs of unescaped ones (i.e., {..} and [...])
 * Fully supports character classes, except for the unicode categories
 * Supports backreferences, but does not check for validity of them
* Supports 'matches', but not 'replace/tokenize' (i.e., empty-matching is not considered wrong) * Supports all sorts of greedy/reluctant quantifiers and errors if specified incorrectly

There's more to it, but I leave that as an exercise to the reader. In a nutshell, if you want to apply it, be aware that it checks the validity of the _syntax_, not the overall validity of the regular expression. That means that ' [z-a] ' validates correct, and ' (.)\2 ' will validate correct as well, but neither will compile.

Run the stylesheet against anything, or with IT "main", and it will give you the following example output:

This valid regex ' (a(\(b(c+?.d)))+ ' is asserted as:  valid
This valid regex ' ((((((.).).)+?))+) ' is asserted as:  valid
This valid regex ' a{1,12}b\{c?? ' is asserted as:  valid
This valid regex ' ([^a-z123\n]+?)\1 ' is asserted as:  valid
This invalid regex ' (a(b(c)) ' is asserted as:  invalid
This invalid regex ' [abe\l] ' is asserted as:  invalid
This invalid regex ' (a|b|c|d)\\\ ' is asserted as:  invalid
This invalid regex ' .++ ' is asserted as:  invalid


All in all, the build-up of the regex to check the regex is quite straight-forward. The only trouble is in the recursive part (which could do with some better writing). Call it as
re:is-regex(....).

Have fun with it. If you plan to add the category escapes to the list of supported features, let me know, please ;)

Cheers,
-- Abel Braaksma

PS: it _can_ be done in regular expressions alone, but then you would loose some of the lexical analysis (paired grouping parentheses, mainly).


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
   xmlns:xs = "http://www.w3.org/2001/XMLSchema";
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
   xmlns:re = "http://regex";
   version="2.0">
<xsl:output method="text" /> <xsl:variable name="test-re">
       <re assert="valid">(a(\(b(c+?.d)))+</re>
       <re assert="valid">((((((.).).)+?))+)</re>
       <re assert="valid">a{1,12}b\{c??</re>
       <re assert="valid">([^a-z123\n]+?)\1</re>
       <re assert="invalid">(a(b(c))</re>
       <re assert="invalid">[abe\l]</re>
       <re assert="invalid">(a|b|c|d)\\\</re>
       <re assert="invalid">.++</re>
   </xsl:variable>
<xsl:template match="/" name="main">
       <xsl:apply-templates select="$test-re/*" />
   </xsl:template>
<xsl:template match="re">
       <xsl:value-of select="
           'This', @assert, 'regex ''',
           ., ''' is asserted as: ',
           ('invalid', 'valid')[1 + xs:integer(re:is-regex(current()))],
           '&#xA;'" />
   </xsl:template>
<!--
       zero-or-one, zero-or-more, one-or-more etc
       (global because we need it in two functions)
    -->
   <xsl:variable name="quantifier" as="xs:string">
       (
       [?*+]                 <!-- q. mark, asterisk, plus sign -->
       |                     <!-- or -->
       \{[0-9]+,?[0-9]*\}    <!-- {.. , ..} quantifier -->
       )\??                  <!-- optional reluctant quantifier -->
   </xsl:variable>
<!--
       Test a regex. It is first 'normalized' by removing any
       whitespace. This may inadvertently make a valid regex
       invalid if an escaped whitespace is part of the regex.
-->
   <xsl:function name="re:test-partial-regex" as="xs:boolean">
       <xsl:param name="re" as="xs:string" />
<!-- a single char, not a meta char -->
       <xsl:variable name="char">
           [^.\\?*+{}()|^$\[\]]
       </xsl:variable>
<xsl:variable name="single-char-escape">
           \\[nrt\\|.?*+(){}^\[\]$\-]
       </xsl:variable>
<xsl:variable name="multi-char-escape">
           (\\[sSiIcCdDwW] | \. | \\[0-9]+)
       </xsl:variable>
<xsl:variable name="char-class-escape">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$single-char-escape" />
           <xsl:text>|</xsl:text>
           <xsl:sequence select="$multi-char-escape" />
           <xsl:text>)</xsl:text>
       </xsl:variable>
<!-- a single meta char -->
       <xsl:variable name="meta-char">
           [nrt\\|.?*+(){}^\[\]$\sSiIcCdDwW-]
       </xsl:variable>
<!-- a char that is allowed inside a character class expression -->
       <xsl:variable name="xml-char">
           [^\\\[\]\-]
       </xsl:variable>
<!-- base 'char' of a character class expression -->
       <xsl:variable name="char-or-esc">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$xml-char" />
           <xsl:text>|</xsl:text>
           <xsl:sequence select="$single-char-escape" />
           <xsl:text>)</xsl:text>
       </xsl:variable>
<!-- range in a character class expression -->
       <xsl:variable name="char-range">
           <xsl:text></xsl:text>
           <xsl:sequence select="$char-or-esc" />
           <xsl:text>-</xsl:text>
           <xsl:sequence select="$char-or-esc" />
       </xsl:variable>
<!-- a character class expression -->
       <!-- all that can be between [ and ] -->
       <xsl:variable name="char-class-expr">
           \[
           (\^|-)?
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$char-or-esc" />
           <xsl:text>|</xsl:text>
<xsl:sequence select="$char-range" /> <xsl:text>)+</xsl:text>
           \]
       </xsl:variable>
<!-- a character class is an escaped meta char, a char class expr or a wildcard expr -->
       <xsl:variable name="char-class">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$char-class-escape" />
           <xsl:text>|</xsl:text>
           <xsl:sequence select="$char-class-expr" />
           <xsl:text>)</xsl:text>
       </xsl:variable>
<!-- a char or an escaped meta char -->
       <xsl:variable name="atom" as="xs:string+">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$char" />
           <xsl:text>|</xsl:text>
           <xsl:sequence select="$char-class" />
           <xsl:text>)</xsl:text>
       </xsl:variable>
<!-- a piece has an atom and optional quantifier -->
       <xsl:variable name="piece">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$atom" />
           <xsl:text>)</xsl:text>
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$quantifier" />
           <xsl:text>)?</xsl:text>
       </xsl:variable>
<!-- a branch has zero or more pieces
       (here: 1 or more, for optimizing the matching) -->
       <xsl:variable name="branch">
           <xsl:text>(</xsl:text>
           <xsl:sequence select="$piece" />
<xsl:text>)+</xsl:text> </xsl:variable>

       <!-- a regexp has one or more branches -->
       <xsl:variable name="regexp" >
<xsl:text>^$|^</xsl:text> <xsl:sequence select="$branch" />
           <xsl:text>(\|</xsl:text>
           <xsl:sequence select="$branch" />
           <xsl:text>)*</xsl:text>
<xsl:text>$</xsl:text> </xsl:variable>
       <xsl:sequence select="matches($re, $regexp, 'x')"  />
   </xsl:function>
<xsl:function name="re:is-regex" as="xs:boolean">
       <xsl:param name="re" as="xs:string" />
<!--
           variable with xs:boolean and xs:string results, the first
           contains results of inner regexes (inside parentheses) the
           second the remaining parenthesized string, which will
           recursively be re-applied
        -->
       <xsl:variable name="result-seq" as="xs:anyAtomicType*">
           <!-- remove the escaped parentheses before applying regex -->
<xsl:analyze-string select="replace($re, '\\\(|\\\)', '')" regex="\(([^()]*)\)({$quantifier})?" flags="x">
               <xsl:matching-substring>
                   <!-- test this deepest level -->
<xsl:sequence select="re:test-partial-regex(regex-group(1))" />
               </xsl:matching-substring>
               <xsl:non-matching-substring>
                   <xsl:sequence select="xs:string(.)" />
               </xsl:non-matching-substring>
           </xsl:analyze-string>
       </xsl:variable>
       <xsl:sequence select="
           (: if a non-regex partial match is found, exit early :)
if(some $b in $result-seq satisfies $b instance of xs:boolean and xs:boolean($b) = false())
           then false()
           else
               (: last string, no parentheses anymore :)
if(count($result-seq) = 1 and $result-seq[1] instance of xs:string)
               then re:test-partial-regex($result-seq[1])
               else
(: more sub regexes available still, recursively apply :)
                   if($result-seq[. instance of xs:string][1])
then re:is-regex(string-join($result-seq[not(. instance of xs:boolean)], '')) else every $b in $result-seq satisfies xs:boolean($b) = true()" />
   </xsl:function>
</xsl:stylesheet>



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--