The CJKCompatibility block covers the codepoint range x3300-x33FF only. I
would imagine that to match Japanese language characters you are looking for
a much larger range than this.
If the range of codepoints you want to match doesn't correspond to one of
the named blocks you can always write, for example [&_#x3000;-&_#xFE4F;]
(without the underscores).
Michael Kay
http://www.saxonica.com/
-----Original Message-----
From: jbesch(_at_)cas(_dot_)org [mailto:jbesch(_at_)cas(_dot_)org]
Sent: 12 June 2006 20:26
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc: jbesch(_at_)cas(_dot_)org
Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular
expressions
How, for example, to use a useful syntax like
matches(.,'\p{Script:Arabic}+') ?
schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs
[Definition:] [Unicode Database] groups code points into a number of
blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul
Jamo, CJK Compatibility, etc. The set containing all characters that
have block name X (with all white space stripped out), can be
identified with a block escape \p{IsX}. The complement of
this set is
specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
...
For example,
the .block escape. for identifying the ASCII characters is
\p{IsBasicLatin}.
so that would be \p(IsArabic)
David
I want to use the above construct to detect Japanese
characters, and so I am using the following xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />
<xsl:template match="/text">
<xsl:for-each select="tokenize(.,'\s+')">
<word>
<xsl:attribute name="language">
<xsl:choose>
<xsl:when
test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
<xsl:when
test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
<xsl:otherwise>Unknown</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
</word>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
However, the Japanese characters in my input, which are
encoded in UTF-8, come out flagged as Latin or Unknown. What
am I doing wrong? How do I get this to recognize the
Japanese characters?
Thanks for any help you can offer.
John Besch
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--