xsl-list
[Top] [All Lists]

RE: [xsl] Safe-guarding codepoints-to-string() from wrong input

2006-12-20 08:20:17
There's no obvious way of doing this within the language, other than
defining a function that knows which codepoints are valid characters.

In Saxon, there's an internal method which should be easy enough to call as
an extension function:

<xsl:if test="nc:isXML11Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML11Char">

or

<xsl:if test="nc:isXML10Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML10Char">

depending on which version of XML you are using.

You could of course run this on all the possible codepoints to generate a
lookup file: you'll want to use keys to make the lookup efficient.

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: Abel Braaksma [mailto:abel(_dot_)online(_at_)xs4all(_dot_)nl] 
Sent: 20 December 2006 14:34
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] Safe-guarding codepoints-to-string() from wrong input

Hi all,

In some translation-stylesheet, I take user-input (arbitrary 
string) and replace a set of numbers to a set of characters, 
like this:

$input = "some [34]quoted[34] string"
output --> some "quoted" string

<xsl:analyze-string select="$input" regex="\[(\d+)\]">
    <xsl:matching-substring>
        <xsl:value-of
select="codepoints-to-string(xs:integer(regex-group(1))" />
    </xsl:matching-substring>
    <xsl:non-matching-substring>
        <xsl:value-of select="." />
    </xsl:non-matching-substring>
</xsl:analyze-string>

Because we are talking tons of data containing the above-like 
strings (in text files), I'd like to make the 
codepoints-to-string() a bit more rock-solid. In normal 
operation, it fails hard. But I'd like it to gracefully 
degrade: be liberal in what you accept.

I know that control characters are not allowed and throw an 
"Invalid XML character" error. Also, when adding very wide 
numbers (like "1234567") give a plural of the same error (Im 
not sure why). Some characters (I believe these are the ones 
that are not assigned in Unicode) result in an empty string 
(like "12345").

Is there a robust way of allowing/disallowing a set of 
codepoints (other than making one huge lookup list)?

Cheers,
Abel







--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--