RE: [XSLT2.0] xsl:analyze-string(_at_)regex syntax too limited

I think this post illustrates why these questions always involve more WG
debate than one initially expects, and why the WG is now taking a fairly
strict line on "no new functionality". (The Perl definition of \b,
incidentally, is quite different from that quoted.)

The other suggestion Gunther made was to relax the rules on vendor
extensions to the regex syntax. There is in fact a proposal on the table
from one of the XQuery vendors to do that. Traditionally the XSL WG has
taken a pretty tough line on vendor extensions, the principle being that it
must be possible for a processor to detect that extensions are in use, and
it must be possible for a user to write fallback code that keeps the
stylesheet portable. This policy can be traced back to the original
expectation that XSLT would usually run in the browser, and the stylesheet
author had no control over which browser it would run in. But I think the
policy has served the community well.

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: Colin Paul Adams 
[mailto:colin(_at_)colina(_dot_)demon(_dot_)co(_dot_)uk] 
Sent: 16 December 2004 07:25
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] [XSLT2.0] xsl:analyze-string(_at_)regex syntax 
too limited

"Gunther" == Gunther Schadow

<gunther(_at_)aurora(_dot_)regenstrief(_dot_)org> writes:

    Gunther> The boundary matcher matches a zero-width substring
    Gunther> between a character matching the character class
    Gunther> [A-Za-z_0-9] and a character matching the character class
    Gunther> [^A-Za-z_0-9] or vice versa.  </quote>

    Gunther> This is pretty clear. It may not make the
    Gunther> internationalization people very happy because I can't do
    Gunther> word-boundary matches on Hindi text. That's a true
    Gunther> concern.

So address it. Unicode report TR18 says (for Level 1 support):

RL1.4         Simple Word Boundaries
      To meet this requirement, an implementation shall 
extend the word boundary mechanism so that:

   1.

      The class of <word_character> includes all the 
Alphabetic values from the Unicode character database, from 
UnicodeData.txt [UData]. See also Annex C: Compatibility Properties.
   2.

      Non-spacing marks are never divided from their base 
characters, and otherwise ignored in locating boundaries. 

Level 2 provides more general support for word boundaries between
arbitrary Unicode characters which may override this behavior.

Level 1 support should certainly be met.
-- 
Colin Paul Adams
Preston Lancashire

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--