RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions

The CJKCompatibility block covers the codepoint range x3300-x33FF only. I
would imagine that to match Japanese language characters you are looking for
a much larger range than this.

If the range of codepoints you want to match doesn't correspond to one of
the named blocks you can always write, for example [&_#x3000;-&_#xFE4F;]
(without the underscores).

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: jbesch(_at_)cas(_dot_)org [mailto:jbesch(_at_)cas(_dot_)org] 
Sent: 12 June 2006 20:26
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc: jbesch(_at_)cas(_dot_)org
Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular 
expressions

How, for example, to use a useful syntax like
  matches(.,'\p{Script:Arabic}+') ?

schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs

[Definition:] [Unicode Database] groups code points into a number of 
blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul 
Jamo, CJK Compatibility, etc. The set containing all characters that 
have block name X (with all white space stripped out), can be 
identified with a block escape \p{IsX}. The complement of

this set is

specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
...
For example,
the .block escape. for identifying the ASCII characters is

\p{IsBasicLatin}.


so that would be \p(IsArabic)

David




I want to use the above construct to detect Japanese 
characters, and so I am using the following xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
     <xsl:output method="xml" indent="yes" encoding="UTF-8" />
     <xsl:template match="/text">
        <xsl:for-each select="tokenize(.,'\s+')">
          <word>
            <xsl:attribute name="language">
              <xsl:choose>
                 <xsl:when 
test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
                 <xsl:when 
test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
                 <xsl:otherwise>Unknown</xsl:otherwise>
              </xsl:choose>
            </xsl:attribute>
          </word>
        </xsl:for-each>
     </xsl:template>
</xsl:stylesheet>

However, the Japanese characters in my input, which are 
encoded in UTF-8, come out flagged as Latin or Unknown.  What 
am I doing wrong?  How do I get this to recognize the 
Japanese characters?

Thanks for any help you can offer.

John Besch


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--