xsl-list
[Top] [All Lists]

Re: [xsl] How to split an RegEx into several lines for readability?

2007-05-01 09:59:43
Dimitre Novatchev wrote:
As I am an absolute RegEx beginner, please excuse me if this is a
trivial question.

A good thing to know about regexes is that, besides being powerful, they can be very dangerous too, esp. to the unaware, when backtracking causes the regex to run with exponential times for non-matching strings. An example of such a regex is in this post: http://www.nabble.com/Certain-non-zero-length-non-matching-regexes-run-forever-on-Saxon-tf3065127.html#a8524868

If you are going to use regexes in a production environment make sure to test them thoroughly for this behavior or your processor may hang occasionally.



Is there any way I can split this RegEx on separate lines and/or add
whitespace so that it would be more readable?

You already heard of the 'x' modifier, but there are a few things that you should know before splitting your regex into a more readable format:

* If you use Saxon, several bugs concerning whitespace handling have been fixed in the 8.8 and 8.9 release, some of which you may consider significant, like this one, which is now fixed: http://www.nabble.com/Bug%3A-whitespace-at-beginning-of-regex-fails-the-regex-when-in-%27x%27-%28ignore-whitespace%29-mode-tf2870226.html#a8022584

* The "ignore whitespace" is very literally so. I.e., in XSLT regexes, this: fn:matches("hello world", "hello\ sworld", "x") returns true. The "\ s" part in the regex is, with whitespace removed, "\s" and matches a space. Most regex engines (Perl for one) consider an escaped space as a space.

* The only place where you must be aware of whitespace with 'x' on i inside classes, where it is not ignored: [abc ] matches 'a', 'b', 'c' or ' '.

* You probably don't want to do this, but this is allowed with the 'x' modifier: "\p{ I s B a s i c L a t i n }+" and is the same as "\p{IsBasicLatin}+".


And a tip for making your regexes more readable: introduce comments inside your regexes. In other regex languages you can do that inside the regex language, but not with a regex in XSLT. You can easily fix this by putting your regexes inside a variable and always calling them with the 'x' modifier:

<xsl:variable name="myregex" as="xs:string">
   (          <!-- grab everything -->
   "          <!-- start of a q. string -->
   [^"]*      <!-- zero or more non-quotes -->
   "          <!-- end of a q. string -->
   )          <!-- closing 'grab all' -->
</xsl:variable>


I use this method to some extend in a format that allows recursive and repetitive regexes on input by just supplying a 'parser' written in XSLT with a set of regexes placed in XML that are then applied to the input. If you have many regexes, you will find that it is easier to maintain them by working on some library and reuse.

Cheers,
-- Abel Braaksma

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>