RE: [xsl] lookaheads in XSLT2 regexes
2010-03-04 11:39:37
I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:
regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"
which seems far more powerful.
Regards,
Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay
-----Original Message-----
From: Imsieke, Gerrit, le-tex
[mailto:gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de]
Sent: 04 March 2010 17:12
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] lookaheads in XSLT2 regexes
Dear Liam,
Thanks for promoting the \b case. As an illustration for \b's
usefulness, let me show how I tag acronyms for a recent project:
<xsl:template match="text()" mode="majuscules">
<xsl:analyze-string select="."
regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
{Z}}\p{{C}}]|$)">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<span class="majusc">
<xsl:value-of select="regex-group(2)"/>
</span>
<xsl:value-of select="regex-group(3)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
With (a reasonably defined) \b, this could be simplified to
<xsl:template match="text()" mode="majuscules">
<xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
<xsl:matching-substring>
<span class="majusc">
<xsl:value-of select="."/>
</span>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Please note that \b should not only match the \w/\W boundary,
but also the beginning or end of the string (or line, when
the 'm' flag is in force). Speaking of the 'm' flag, and in
Michael's direction: I regard \b as much more useful than the
'm' flag when processing XML.
Gerrit
On 04.03.2010 06:59, Liam R E Quin wrote:
On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
On the subject of \b I'll note we do have \W and \w
So we do, I overlooked that. And we define it a little differently
from
Perl:
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
So for example "+" is regarded as part of a word, while "-" isn't.
Which strikes me as totally useless, to be honest.
I agree.
We could fix that for XPath 2.1 I think. I'm not sure what
the most
useful fix would be, I admit.
The Perl definition of "alphanumeric" plus "_" would
probably work for
\w, if one took alphnumeric to mean Letters|Numbers,
\p{L}|\p{N}, and
is coincidentally closer to what you get in Perl if you do
use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as the
POSIX fragment [[:alpha:][:digit:]_]
There are lots of things that could be added to regular
expressions;
but \b is hard to emulate, useful, and also we seem to have
a rather
odd \w. If \w is there, I think \b was omitted by mistake.
Or that
\w was included by mistake!
Liam
--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341
355356 110, Fax +49 341 355356 510 gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de,
http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas
Schmidt, Dr. Reinhard Vöckler
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- Re: [xsl] lookaheads in XSLT2 regexes, (continued)
- Re: [xsl] lookaheads in XSLT2 regexes, Liam R E Quin
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex
- Re: [xsl] lookaheads in XSLT2 regexes, Michael Ludwig
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- Re: [xsl] lookaheads in XSLT2 regexes, Michael Ludwig
- RE: [xsl] lookaheads in XSLT2 regexes, Liam R E Quin
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- RE: [xsl] lookaheads in XSLT2 regexes, Liam R E Quin
- Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex
- RE: [xsl] lookaheads in XSLT2 regexes,
Michael Kay <=
- Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex
- Re: [xsl] lookaheads in XSLT2 regexes, Dave Pawson
|
Previous by Date: |
Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex |
Next by Date: |
Re: [xsl] Pattern Substring, Wendell Piez |
Previous by Thread: |
Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex |
Next by Thread: |
Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex |
Indexes: |
[Date]
[Thread]
[Top]
[All Lists] |
|
|