xsl-list
[Top] [All Lists]

Re: [xsl] How to split an RegEx into several lines for readability?

2007-05-01 21:29:34
Thank you Abel,

I already read about such notorious use of RegExes.

And a tip for making your regexes more readable: introduce comments
inside your regexes. In other regex languages you can do that inside the
regex language, but not with a regex in XSLT. You can easily fix this by
putting your regexes inside a variable and always calling them with the
'x' modifier:

<xsl:variable name="myregex" as="xs:string">
   (          <!-- grab everything -->
   "          <!-- start of a q. string -->
   [^"]*      <!-- zero or more non-quotes -->
   "          <!-- end of a q. string -->
   )          <!-- closing 'grab all' -->
</xsl:variable>



I think this is probably the most amazing and useful tip I got in this
thread -- fully deserves to be in the XSLT FAQ!

I have never before seen a string variable defined in this way --
probably because we do nod have an "x" modifier (and modifiers at all)
when generally defining variables.

Also, I don't think I've ever seen before comments interspersed within
the string contents of an xsl:variable.

Very nice!

--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play


On 5/1/07, Abel Braaksma <abel(_dot_)online(_at_)xs4all(_dot_)nl> wrote:
Dimitre Novatchev wrote:
> As I am an absolute RegEx beginner, please excuse me if this is a
> trivial question.

A good thing to know about regexes is that, besides being powerful, they
can be very dangerous too, esp. to the unaware, when backtracking causes
the regex to run with exponential times for non-matching strings. An
example of such a regex is in this post:
http://www.nabble.com/Certain-non-zero-length-non-matching-regexes-run-forever-on-Saxon-tf3065127.html#a8524868

If you are going to use regexes in a production environment make sure to
test them thoroughly for this behavior or your processor may hang
occasionally.
>
>
>
> Is there any way I can split this RegEx on separate lines and/or add
> whitespace so that it would be more readable?

You already heard of the 'x' modifier, but there are a few things that
you should know before splitting your regex into a more readable format:

 * If you use Saxon, several bugs concerning whitespace handling have
been fixed in the 8.8 and 8.9 release, some of which you may consider
significant, like this one, which is now fixed:
http://www.nabble.com/Bug%3A-whitespace-at-beginning-of-regex-fails-the-regex-when-in-%27x%27-%28ignore-whitespace%29-mode-tf2870226.html#a8022584

 * The "ignore whitespace" is very literally so. I.e., in XSLT regexes,
this:  fn:matches("hello world", "hello\ sworld", "x") returns true. The
"\ s" part in the regex is, with whitespace removed, "\s" and matches a
space. Most regex engines (Perl for one) consider an escaped space as a
space.

  * The only place where you must be aware of whitespace with 'x' on i
inside classes, where it is not ignored: [abc ] matches 'a', 'b', 'c' or
' '.

  * You probably don't want to do this, but this is allowed with the
'x' modifier: "\p{ I s B a s i c L a t i n }+" and is the same as
"\p{IsBasicLatin}+".


And a tip for making your regexes more readable: introduce comments
inside your regexes. In other regex languages you can do that inside the
regex language, but not with a regex in XSLT. You can easily fix this by
putting your regexes inside a variable and always calling them with the
'x' modifier:

<xsl:variable name="myregex" as="xs:string">
   (          <!-- grab everything -->
   "          <!-- start of a q. string -->
   [^"]*      <!-- zero or more non-quotes -->
   "          <!-- end of a q. string -->
   )          <!-- closing 'grab all' -->
</xsl:variable>


I use this method to some extend in a format that allows recursive and
repetitive regexes on input by just supplying a 'parser' written in XSLT
with a set of regexes placed in XML that are then applied to the input.
If you have many regexes, you will find that it is easier to maintain
them by working on some library and reuse.

Cheers,
-- Abel Braaksma

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>