xsl-list
[Top] [All Lists]

Re: [XSLT2.0] xsl:analyze-string(_at_)regex syntax too limited

2004-12-16 16:14:56
Thanks, good find. The only problem now is that this issue needs to be 
adressed in java.util.regex.

Colin Paul Adams wrote:

"Gunther" == Gunther Schadow 
<gunther(_at_)aurora(_dot_)regenstrief(_dot_)org> writes:


    Gunther> The boundary matcher matches a zero-width substring
    Gunther> between a character matching the character class
    Gunther> [A-Za-z_0-9] and a character matching the character class
    Gunther> [^A-Za-z_0-9] or vice versa.  </quote>

    Gunther> This is pretty clear. It may not make the
    Gunther> internationalization people very happy because I can't do
    Gunther> word-boundary matches on Hindi text. That's a true
    Gunther> concern.

So address it. Unicode report TR18 says (for Level 1 support):

RL1.4         Simple Word Boundaries
      To meet this requirement, an implementation shall extend the word 
boundary mechanism so that:

   1.

      The class of <word_character> includes all the Alphabetic values from 
the Unicode character database, from UnicodeData.txt [UData]. See also Annex 
C: Compatibility Properties.
   2.

      Non-spacing marks are never divided from their base characters, and 
otherwise ignored in locating boundaries. 

Level 2 provides more general support for word boundaries between
arbitrary Unicode characters which may override this behavior.

Level 1 support should certainly be met.

-- 
Gunther Schadow, M.D., Ph.D.                  
gschadow(_at_)regenstrief(_dot_)org
Associate Professor           Indiana University School of Informatics
Regenstrief Institute, Inc.      Indiana University School of Medicine
tel:1(317)630-7960                       http://aurora.regenstrief.org

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--