Abel Braaksma wrote:
Hi
I know that control characters are not allowed and throw
an "Invalid XML character" error. Also, when adding very
wide numbers (like "1234567") give a plural of the same
error (Im not sure why). Some characters (I believe these
are the ones that are not assigned in Unicode) result in
an empty string (like "12345").
Is there a robust way of allowing/disallowing a set of
codepoints (other than making one huge lookup list)?
Technically, it is not complex. Just define a function
my:codepoints-to-string() that makes the needed checks and
do what you want when encoutering an invalid codepoint. I
think the most difficult part is identifying which
codepoints are valid. You can use the following from the
XML recommendation as starting point:
/* any Unicode character, excluding the surrogate
blocks, FFFE, and FFFF. */
[2] Char ::= #x9
| #xA
| #xD
| [#x20-#xD7FF]
| [#xE000-#xFFFD]
| [#x10000-#x10FFFF]
Document authors are encouraged to avoid "compatibility
characters", as defined in section 6.8 of [Unicode] (see
also D21 in section 3.6 of [Unicode3]). The characters
defined in the following ranges are also
discouraged. They are either control characters or
permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
When you have identified the (in)valid codepoints, you
will have to choose what to do with (in)valid codepoints.
For example, calling codepoints-to-string() for valid
codepoints, and return the empty sequence or the empty
string for invalid one:
<xsl:function name="my:is-in-range" as="xs:boolean">
<xsl:param name="value" as="xs:integer"/>
<xsl:param name="down" as="xs:integer"/>
<xsl:param name="up" as="xs:integer"/>
<xsl:sequence select="$value ge $down and $value le $up"/>
</xsl:function>
<xsl:function name="my:is-valid-codepoint" as="xs:boolean">
<xsl:param name="cp" as="xs:integer"/>
<xsl:sequence select="
$cp = (9, 10, 13)
or my:is-in-range($cp, 32, 55295)
or my:is-in-range($cp, 57344, 65533)
or my:is-in-range($cp, 65636, 1114111)"/>
</xsl:function>
<xsl:function name="my:codepoint-to-string" as="xs:string?">
<xsl:param name="cp" as="xs:integer"/>
<xsl:if test="my:is-valid-codepoint($cp)">
<xsl:sequence select="codepoints-to-string($cp)"/>
</xsl:if>
</xsl:function>
or instead the following, depending on your needs:
<xsl:function name="my:codepoints-to-string" as="xs:string">
<xsl:param name="cp" as="xs:integer*"/>
<xsl:sequence select="
codepoints-to-string($cp[my:is-valid-codepoint(.)])"/>
</xsl:function>
Regards,
--drkm
___________________________________________________________________________
Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son
interface révolutionnaire.
http://fr.mail.yahoo.com
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--