How, for example, to use a useful syntax like
matches(.,'\p{Script:Arabic}+') ?
schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs
[Definition:] [Unicode Database] groups code points into a number of
blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul
Jamo, CJK Compatibility, etc. The set containing all characters that
have block name X (with all white space stripped out), can be identified
with a block escape \p{IsX}. The complement of this set is specified
with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
...
For example,
the ·block escape· for identifying the ASCII characters is \p{IsBasicLatin}.
so that would be \p(IsArabic)
David
I want to use the above construct to detect Japanese characters, and so I am
using the
following xsl:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="UTF-8" />
<xsl:template match="/text">
<xsl:for-each select="tokenize(.,'\s+')">
<word>
<xsl:attribute name="language">
<xsl:choose>
<xsl:when
test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
<xsl:when
test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
<xsl:otherwise>Unknown</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
</word>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
However, the Japanese characters in my input, which are encoded in UTF-8, come
out flagged as Latin
or Unknown. What am I doing wrong? How do I get this to recognize the
Japanese characters?
Thanks for any help you can offer.
John Besch
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--