xsl-list
[Top] [All Lists]

XSLT 2.0 : Unicode hex notation in regular expressions

2004-08-12 02:38:08
Hi,

I don't know if my XSLT syntax is wrong or if it is a Saxon-related problem. Let's blame the XSLT writer rather than the XSLT processor first ;-)

Given the following XML :

<?xml version="1.0" encoding="UTF-8"?>
<text>livre : كتاب</text>

And the following XSLT :

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xsl:template match="/text">
<xsl:comment><xsl:value-of select="system-property('xsl:vendor')" /></xsl:comment>
    <words>
      <xsl:for-each select="tokenize(.,'\s+')">
        <word>
          <xsl:attribute name="language">
            <xsl:choose>
              <xsl:when test="matches(.,'[a-z]+')">latin</xsl:when>
<xsl:when test="matches(.,'[\\u0600-\\u06FF]+')">arabic</xsl:when>
              <xsl:otherwise>whatever</xsl:otherwise>
            </xsl:choose>
          </xsl:attribute>
<xsl:attribute name="codepoints"><xsl:value-of select="string-to-codepoints(.)"/></xsl:attribute>
          <xsl:value-of select="."/>
        </word>
      </xsl:for-each>
    </words>
  </xsl:template>
</xsl:stylesheet>

I get :

<?xml version="1.0" encoding="UTF-8"?>
<!--SAXON 8.0 from Saxonica-->
<words>
  <word language="latin" codepoints="108 105 118 114 101">livre</word>
  <word language="arabic" codepoints="58">:</word>
  <word language="whatever" codepoints="1603 1578 1575 1576">كتاب</word>
</words>

Why this curious match for codepoint 58 ? And why no match for the arabic characters ?

BTW, I first tried : matches(.,'[\u0600-\u06FF]+') as stated by http://www.unicode.org/reports/tr18/#Hex_notation

But Saxon returned the following error :

Error at xsl:when on line 11 of file:/C:/...:
net.sf.saxon.type.RegexTranslator$RegexSyntaxException: Error at character 2 in regular expression: bad escape sequence

That's why I doubled the "\" character. Is this doubling spec-compliant ?

Cheers,

p.b.