xsl-list
[Top] [All Lists]

Re: [xsl] lookaheads in XSLT2 regexes

2010-03-04 11:12:56
Dear Liam,

Thanks for promoting the \b case. As an illustration for \b's usefulness, let me show how I tag acronyms for a recent project:

  <xsl:template match="text()" mode="majuscules">
<xsl:analyze-string select="." regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{{Z}}\p{{C}}]|$)">
      <xsl:matching-substring>
        <xsl:value-of select="regex-group(1)"/>
        <span class="majusc">
          <xsl:value-of select="regex-group(2)"/>
        </span>
        <xsl:value-of select="regex-group(3)"/>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

With (a reasonably defined) \b, this could be simplified to

  <xsl:template match="text()" mode="majuscules">
    <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
      <xsl:matching-substring>
        <span class="majusc">
          <xsl:value-of select="."/>
        </span>
      </xsl:matching-substring>
      <xsl:non-matching-substring>
        <xsl:value-of select="."/>
      </xsl:non-matching-substring>
    </xsl:analyze-string>
  </xsl:template>

Please note that \b should not only match the \w/\W boundary, but also the beginning or end of the string (or line, when the 'm' flag is in force). Speaking of the 'm' flag, and in Michael's direction: I regard \b as much more useful than the 'm' flag when processing XML.

Gerrit



On 04.03.2010 06:59, Liam R E Quin wrote:
On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
On the subject of \b I'll note we do have \W and \w

So we do, I overlooked that. And we define it a little differently from
Perl:

[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]

So for example "+" is regarded as part of a word, while "-" isn't. Which
strikes me as totally useless, to be honest.

I agree.

We could fix that for XPath 2.1 I think.  I'm not sure what the most
useful fix would be, I admit.

The Perl definition of "alphanumeric" plus "_" would probably work for
\w, if one took alphnumeric to mean Letters|Numbers, \p{L}|\p{N},
and is coincidentally closer to what you get in Perl if you do
     use locale;
and your locale is (say) en_UK.UTF8, as it's then the same as
the POSIX fragment [[:alpha:][:digit:]_]

There are lots of things that could be added to regular expressions;
but \b is hard to emulate, useful, and also we seem to have a rather
odd \w.  If \w is there, I think \b was omitted by mistake.  Or that
\w was included by mistake!

Liam


--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>