Re: [xsl] lookaheads in XSLT2 regexes
2010-03-04 15:30:45
On 04.03.2010 18:39, Michael Kay wrote:
I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:
If no canonical definition of \w seems feasible and definitions that
depend on either locale or a user's configuration file yield unexpected
results for other users -- maybe resort to a \w that may be defined on a
per-stylesheet basis. As I suggested in a former posting, one could use
a stylesheet attribute with a (limited) regex syntax, e.g.:
<xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}‑]">
When compiling the stylesheet, a preprocessor would statically expand
\w, \W, and \b. Of course the word constituents must be thoroughly
checked against the limited syntax prior to expansion, in order to
ensure that otherwise valid regexes remain valid.
regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"
Tried this in Perl; the lookbehind didn't match ^ (beginning of
line/string), while the lookahead matched $
Maybe this is different with Java. But if this aspect of lookbehind
behaviour turns out to be implementation-dependent, the predictability
constraint is violated.
In addition, as Liam pointed out, the '<' character in the regex
attribute might irritate the XML parser.
And I think for commonplace situations such as word boundaries (whatever
definition of 'word' you might choose), a crisp single-char escape as \b
should be available (in addition to the powerful and flexible lookahead
and lookbehind assertions).
This reminds me of the classic mod_rewrite motto:
``The great thing about mod_rewrite is it gives you all the
configurability and flexibility of Sendmail. The downside to mod_rewrite
is that it gives you all the configurability and flexibility of Sendmail.''
Or to cite another CS folklore: "Make the easy things easy and the hard
things possible."
Of course if you doubt that the concept of a word boundary or a word
constituent is an easy (in the sense of commonplace) one, the users will
have to resort to the flexible lookahead mechanisms (once they are
available in XSLT 2.1).
A compromise will be (as suggested above):
- allow concise \b and \w syntax in the regexes,
- per-stylesheet means to redefine the default word constituent expression
Gerrit
which seems far more powerful.
Regards,
Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay
-----Original Message-----
From: Imsieke, Gerrit, le-tex [mailto:gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de]
Sent: 04 March 2010 17:12
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] lookaheads in XSLT2 regexes
Dear Liam,
Thanks for promoting the \b case. As an illustration for \b's
usefulness, let me show how I tag acronyms for a recent project:
<xsl:template match="text()" mode="majuscules">
<xsl:analyze-string select="."
regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
{Z}}\p{{C}}]|$)">
<xsl:matching-substring>
<xsl:value-of select="regex-group(1)"/>
<span class="majusc">
<xsl:value-of select="regex-group(2)"/>
</span>
<xsl:value-of select="regex-group(3)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
With (a reasonably defined) \b, this could be simplified to
<xsl:template match="text()" mode="majuscules">
<xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
<xsl:matching-substring>
<span class="majusc">
<xsl:value-of select="."/>
</span>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
Please note that \b should not only match the \w/\W boundary,
but also the beginning or end of the string (or line, when
the 'm' flag is in force). Speaking of the 'm' flag, and in
Michael's direction: I regard \b as much more useful than the
'm' flag when processing XML.
Gerrit
On 04.03.2010 06:59, Liam R E Quin wrote:
On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
On the subject of \b I'll note we do have \W and \w
So we do, I overlooked that. And we define it a little differently
from
Perl:
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
So for example "+" is regarded as part of a word, while "-" isn't.
Which strikes me as totally useless, to be honest.
I agree.
We could fix that for XPath 2.1 I think. I'm not sure what
the most
useful fix would be, I admit.
The Perl definition of "alphanumeric" plus "_" would
probably work for
\w, if one took alphnumeric to mean Letters|Numbers,
\p{L}|\p{N}, and
is coincidentally closer to what you get in Perl if you do
use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as the
POSIX fragment [[:alpha:][:digit:]_]
There are lots of things that could be added to regular
expressions;
but \b is hard to emulate, useful, and also we seem to have
a rather
odd \w. If \w is there, I think \b was omitted by mistake.
Or that
\w was included by mistake!
Liam
--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341
355356 110, Fax +49 341 355356 510 gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de,
http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas
Schmidt, Dr. Reinhard Vöckler
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
<Prev in Thread] |
Current Thread |
[Next in Thread>
|
- RE: [xsl] lookaheads in XSLT2 regexes, (continued)
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex
- Re: [xsl] lookaheads in XSLT2 regexes, Michael Ludwig
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- Re: [xsl] lookaheads in XSLT2 regexes, Michael Ludwig
- RE: [xsl] lookaheads in XSLT2 regexes, Liam R E Quin
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- RE: [xsl] lookaheads in XSLT2 regexes, Liam R E Quin
- Re: [xsl] lookaheads in XSLT2 regexes, Imsieke, Gerrit, le-tex
- RE: [xsl] lookaheads in XSLT2 regexes, Michael Kay
- Re: [xsl] lookaheads in XSLT2 regexes,
Imsieke, Gerrit, le-tex <=
- Re: [xsl] lookaheads in XSLT2 regexes, Dave Pawson
|
|
|