Re: [xsl] lookaheads in XSLT2 regexes



On 04.03.2010 18:39, Michael Kay wrote:

I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:

If no canonical definition of \w seems feasible and definitions thatdepend on either locale or a user's configuration file yield unexpectedresults for other users -- maybe resort to a \w that may be defined on aper-stylesheet basis. As I suggested in a former posting, one could usea stylesheet attribute with a (limited) regex syntax, e.g.:

<xsl:stylesheet ... word-constituents="[\p{Ll}\p{Lu}&#x2011;]">

When compiling the stylesheet, a preprocessor would statically expand\w, \W, and \b. Of course the word constituents must be thoroughlychecked against the limited syntax prior to expansion, in order toensure that otherwise valid regexes remain valid.


regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"

Tried this in Perl; the lookbehind didn't match ^ (beginning ofline/string), while the lookahead matched $Maybe this is different with Java. But if this aspect of lookbehindbehaviour turns out to be implementation-dependent, the predictabilityconstraint is violated.In addition, as Liam pointed out, the '<' character in the regexattribute might irritate the XML parser.And I think for commonplace situations such as word boundaries (whateverdefinition of 'word' you might choose), a crisp single-char escape as \bshould be available (in addition to the powerful and flexible lookaheadand lookbehind assertions).


This reminds me of the classic mod_rewrite motto:

``The great thing about mod_rewrite is it gives you all theconfigurability and flexibility of Sendmail. The downside to mod_rewriteis that it gives you all the configurability and flexibility of Sendmail.''

Or to cite another CS folklore: "Make the easy things easy and the hardthings possible."

Of course if you doubt that the concept of a word boundary or a wordconstituent is an easy (in the sense of commonplace) one, the users willhave to resort to the flexible lookahead mechanisms (once they areavailable in XSLT 2.1).


A compromise will be (as suggested above):
- allow concise \b and \w syntax in the regexes,
- per-stylesheet means to redefine the default word constituent expression

Gerrit


which seems far more powerful.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay

-----Original Message-----
From: Imsieke, Gerrit, le-tex [mailto:gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de]
Sent: 04 March 2010 17:12
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] lookaheads in XSLT2 regexes

Dear Liam,

Thanks for promoting the \b case. As an illustration for \b's
usefulness, let me show how I tag acronyms for a recent project:

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="."
regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
{Z}}\p{{C}}]|$)">
        <xsl:matching-substring>
          <xsl:value-of select="regex-group(1)"/>
          <span class="majusc">
            <xsl:value-of select="regex-group(2)"/>
          </span>
          <xsl:value-of select="regex-group(3)"/>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

With (a reasonably defined) \b, this could be simplified to

    <xsl:template match="text()" mode="majuscules">
      <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
        <xsl:matching-substring>
          <span class="majusc">
            <xsl:value-of select="."/>
          </span>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:template>

Please note that \b should not only match the \w/\W boundary,
but also the beginning or end of the string (or line, when
the 'm' flag is in force). Speaking of the 'm' flag, and in
Michael's direction: I regard \b as much more useful than the
'm' flag when processing XML.

Gerrit



On 04.03.2010 06:59, Liam R E Quin wrote:

On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:

On the subject of \b I'll note we do have \W and \w


So we do, I overlooked that. And we define it a little differently
from
Perl:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]

So for example "+" is regarded as part of a word, while "-" isn't.
Which strikes me as totally useless, to be honest.


I agree.

We could fix that for XPath 2.1 I think.  I'm not sure what

the most

useful fix would be, I admit.

The Perl definition of "alphanumeric" plus "_" would

probably work for

\w, if one took alphnumeric to mean Letters|Numbers,

\p{L}|\p{N}, and

is coincidentally closer to what you get in Perl if you do
      use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as the
POSIX fragment [[:alpha:][:digit:]_]

There are lots of things that could be added to regular

expressions;

but \b is hard to emulate, useful, and also we seem to have

a rather

odd \w.  If \w is there, I think \b was omitted by mistake.

  Or that

\w was included by mistake!

Liam


--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341
355356 110, Fax +49 341 355356 510 gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de,
http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek, Thomas
Schmidt, Dr. Reinhard Vöckler

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--