xsl-list
[Top] [All Lists]

Re: [xsl] Safe-guarding codepoints-to-string() from wrong input

2006-12-20 08:20:36
Abel Braaksma wrote:

  Hi

I know that control characters are not allowed and throw
an "Invalid XML character" error. Also, when adding very
wide numbers (like "1234567") give a plural of the same
error (Im not sure why). Some characters (I believe these
are the ones that are not assigned in Unicode) result in
an empty string (like "12345").

Is there a robust way of allowing/disallowing a set of
codepoints (other than making one huge lookup list)?

  Technically, it is not complex.  Just define a function
my:codepoints-to-string() that makes the needed checks and
do what you want when encoutering an invalid codepoint.  I
think the most difficult part is identifying which
codepoints are valid.  You can use the following from the
XML recommendation as starting point:

    /* any Unicode character, excluding the surrogate
       blocks, FFFE, and FFFF. */
    [2] Char ::= #x9
                 | #xA
                 | #xD
                 | [#x20-#xD7FF]
                 | [#xE000-#xFFFD]
                 | [#x10000-#x10FFFF]

    Document authors are encouraged to avoid "compatibility
    characters", as defined in section 6.8 of [Unicode] (see
    also D21 in section 3.6 of [Unicode3]). The characters
    defined in the following ranges are also
    discouraged. They are either control characters or
    permanently undefined Unicode characters:

    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
    [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
    [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
    [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
    [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
    [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
    [#x10FFFE-#x10FFFF].

  When you have identified the (in)valid codepoints, you
will have to choose what to do with (in)valid codepoints.
For example, calling codepoints-to-string() for valid
codepoints, and return the empty sequence or the empty
string for invalid one:

    <xsl:function name="my:is-in-range" as="xs:boolean">
      <xsl:param name="value" as="xs:integer"/>
      <xsl:param name="down"  as="xs:integer"/>
      <xsl:param name="up"    as="xs:integer"/>
      <xsl:sequence select="$value ge $down and $value le $up"/>
    </xsl:function>

    <xsl:function name="my:is-valid-codepoint" as="xs:boolean">
      <xsl:param name="cp" as="xs:integer"/>
      <xsl:sequence select="
          $cp = (9, 10, 13)
            or my:is-in-range($cp,    32,   55295)
            or my:is-in-range($cp, 57344,   65533)
            or my:is-in-range($cp, 65636, 1114111)"/>
    </xsl:function>

    <xsl:function name="my:codepoint-to-string" as="xs:string?">
      <xsl:param name="cp" as="xs:integer"/>
      <xsl:if test="my:is-valid-codepoint($cp)">
        <xsl:sequence select="codepoints-to-string($cp)"/>
      </xsl:if>
    </xsl:function>

or instead the following, depending on your needs:

    <xsl:function name="my:codepoints-to-string" as="xs:string">
      <xsl:param name="cp" as="xs:integer*"/>
      <xsl:sequence select="
          codepoints-to-string($cp[my:is-valid-codepoint(.)])"/>
    </xsl:function>

  Regards,

--drkm
























        

        
                
___________________________________________________________________________ 
Yahoo! Mail réinvente le mail ! Découvrez le nouveau Yahoo! Mail et son 
interface révolutionnaire.
http://fr.mail.yahoo.com

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--