Fw: [xsl] decoding percent-escaped octet sequences

Trying to send again, this time not as UTF-8 email ...

----- Forwarded by Hermann Stamm-Wilbrandt/Germany/IBM on 05/23/2011 11:47 
AM -----

From:   Hermann Stamm-Wilbrandt/Germany/IBM
To:     xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Date:   05/23/2011 10:37 AM
Subject:        Re: [xsl] decoding percent-escaped octet sequences

DataPower provides a convert-http action to be able to process HTTP form 
submissions which are Non-XML.
At the time this entered the product (before acquisition by IBM in 2005) 
the default encoding for URL-encoded strings was ISO-8859-1.

The equivalent of convert-action to be used inside DataPower stylesheets 
is the dp:decode() extension function:
http://publib.boulder.ibm.com/infocenter/wsdatap/v3r8m2/index.jsp?topic=/xa35/extensionfunctions41.htm

Last year a customer requested to be able to deal with UTF-8 URL-encoded 
URIs (because Google returns those to them).

I provided an implementation for that in a technote and a Webcast:
http://www-01.ibm.com/support/docview.wss?uid=swg21412370
http://www-01.ibm.com/support/docview.wss?uid=swg27019118&aid=1#page=15

This implementation is based on EXSLT extension function str:decode-uri() 
(DataPower is a XSLT 1.0 processor).
http://exslt.org/str/functions/decode-uri/index.html

I modified the stylesheet from the technote to eliminate the access to 
"dp:variable()".
This way it even works with xsltproc, see below.

$ xsltproc utf8uriDemo.xsl utf8uriDemo.xsl 
<?xml version="1.0"?>
<request xmlns:uri="http://uri
"><url>/utf8uri?danish=%C3%86-%C3%98-%C3%85&amp;french=%C5%92-%C3%A6&amp;german=%C3%84-%C3%96-%C3%9C-%C3%9F&amp;spanish=%CA%A7-%EA%9D%86-%C3%91</url><base-url>/utf8uri</base-url><args

src="url"><arg name="danish">Æ-Ø-Å</arg><arg name="french">?-æ</arg><arg 
name="german">Ä-Ö-Ü-ß</arg><arg 
name="spanish">?-?-Ñ</arg></args></request>
$ 
$ cat utf8uriDemo.xsl 
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  xmlns:str="http://exslt.org/strings";
  xmlns:uri="http://uri";
  exclude-result-prefixes="str"

  <xsl:template match="/">
    <xsl:variable 
name="url"><![CDATA[/utf8uri?danish=%C3%86-%C3%98-%C3%85&french=%C5%92-%C3%A6&german=%C3%84-%C3%96-%C3%9C-%C3%9F&spanish=%CA%A7-%EA%9D%86-%C3%91]]></xsl:variable>

    <request>
      <url><xsl:copy-of select="$url"/></url>
      <base-url>
        <xsl:copy-of select="substring-before($url,'?')"/>
      </base-url>
      <args src="url">
        <xsl:for-each 
          select="str:tokenize(substring-after($url,'?'),'&amp;')">
          <xsl:element name="arg">
            <xsl:attribute name="name">
              <xsl:value-of select="substring-before(.,'=')"/>
            </xsl:attribute> 
            <xsl:value-of 
              select="str:decode-uri(substring-after(.,'='))"/> 
          </xsl:element>
        </xsl:for-each>
      </args>
    </request>
  </xsl:template>
 
</xsl:stylesheet>
$ 


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
Fixpack team lead
WebSphere DataPower SOA Appliances
https://www.ibm.com/developerworks/mydeveloperworks/blogs/HermannSW/
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294 



From:   Chris Maloney <voldrani(_at_)gmail(_dot_)com>
To:     xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc:     Brandon Ibach <brandon(_dot_)ibach(_at_)single-sourcing(_dot_)com>
Date:   05/20/2011 07:22 PM
Subject:        Re: [xsl] decoding percent-escaped octet sequences



On Fri, May 20, 2011 at 12:14 PM, Julian Reschke 
<julian(_dot_)reschke(_at_)gmx(_dot_)de> 
wrote:

On 2011-05-20 17:52, Brandon Ibach wrote:

Generally, when you're doing string manipulations inside XSLT/XPath,
there really is no such thing as ISO-8859-1, UTF-8 or any other
encoding, since the "string" data type in XPath is just a string of
Unicode characters.


But Julian is right that a percent-encoded string, which represents a
byte sequence, can be considered to be encoded in one or another way.
I investigated this same kind of thing for the site I work on, and I
don't have a solution for how to convert these to strings inside XSLT,
but I thought I'd just paste some of the test cases I worked with, in
case they might prove interesting or useful.

1. UTF-8 encoded single character
A. ?term=%C3%84rzteblatt
"Ärzteblatt"

2. Invalid character codes (ASCII control character(s), but not valid
ISO-8859-1 or UTF-8)
A. ?term=%02%03cat

3. Non UTF-8, ISO-8859-1, single character
A. ?term=%C4rzteblatt
"Ärzteblatt"

4. Invalid byte sequence (not valid utf-8 or iso-8859-1)
A. ?term=%C4%83%C4cat

5. Chinese characters, UTF-8 encoded
A. ?term=%e4%bd%a0%e5%a5%bd
Search box: "??"

6. ISO-8859-1 multi-byte - this sequence starts out looking like UTF-8, 
but
it's not.
A. ?term=%c4%A0%c4rzteblatt
Search box: "Ä Ärzteblatt"


After working with this for a while, we reached the conclusion that
it's best to try to strictly enforce the rule that percent-encoding in
URLs be UTF-8.  In other words, I think it's a bad idea to try to
continue to maintain ISO-8859-1 encoded URLs, because it just leads to
too many possible problems, that are very hard to debug.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--