xsl-list
[Top] [All Lists]

Re: [xsl] How to copy attribute value to text? (Suspected bug involving supplementary characters)

2016-07-07 13:54:37




From: Kenneth Reid Beesley <krbeesley(_at_)gmail(_dot_)com>
Subject: Re: [XSL-List: The Open Forum on XSL] Digest for 2016-07-06
Date: July 7, 2016 at 12:43:54 PM EDT
To: "XSL-List: The Open Forum on XSL" 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>


Many thanks to Martin Honnen for his response below.  I add more comments below 
(suspected bug in Saxon).


On 7Jul2016, at 05:28, XSL-List: The Open Forum on XSL 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com 
<mailto:xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>> wrote:

From: Martin Honnen <martin(_dot_)honnen(_at_)gmx(_dot_)de 
<mailto:martin(_dot_)honnen(_at_)gmx(_dot_)de>>
Subject: Re: [xsl] How to copy attribute value to text?
Date: 7 July 2016 at 00:43:37 MDT
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com 
<mailto:xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>


On 07.07.2016 07:22, Kenneth Reid Beesley krbeesley(_at_)gmail(_dot_)com 
<mailto:krbeesley(_at_)gmail(_dot_)com> wrote:
If I start with an input XML document that contains mixed text with <word> 
elements like this:

     … this is just <word correction=“too”>to</word> funny

I’d like to write an XSLT stylesheet that yields as output

     … this is just <word origerror=“to”>too</word> funny

So in the output I effectively want (in the same <word> element) to

     1.  Set the value of a new attribute to the original text() value, and
     2.  Reset the text() value to be the value of the original @correction 
attribute

I’ve tried many variants of the following, so far without success.  I’m 
using SaxonHE9-7-0-6J;
it runs, but the results are not as expected/hoped.

I’ve tried matching the text() in a separate template, but I can’t seem to 
reference the attribute values of the parent node (i.e., <word>) of the 
text() and the parent node’s attributes.  E.g, the following doesn’t work 
for me, failing somehow in the
select=“../@correction”  reference.

<xsl:template match=“word[@correction]/text()”>
     <xsl:value-of select=“../@correction”/>
</xsl:template>


You can use

      <xsl:template match="@* | node()">
              <xsl:copy>
                      <xsl:apply-templates select="@* | node()"/>
              </xsl:copy>
      </xsl:template>
      
      <xsl:template match="word[@correction]/text()">
              <xsl:value-of select="../@correction"/>
      </xsl:template>
      
      <xsl:template match="word/@correction">
              <xsl:attribute name="origerror" select=".."/>
      </xsl:template>

Your solution looks perfect and appears to work perfectly for ASCII-based XML 
input examples like the following

<?xml version="1.0" encoding="UTF-8"?>

<foo>
  <bar>this is just <word correction="too">to</word> funny</bar>
</foo>

yielding the correct/desired output

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>this is just <word origerror="to">too</word> funny</bar>
</foo>


I now see that some of my own attempts also worked, on the same ASCII-based 
example.

*****  Suspected bug involving supplementary characters *****

But my real task involves an input XML document, in UTF-8 encoding, that 
consists of Deseret Alphabet characters, which are encoded in the supplementary 
area.  In such a case, the resulting text content in the <word> element, copied 
from an original attribute value, is corrupted.  I saw such corruption in my 
own attempts, and couldn’t understand what was happening.

Using the following input document (the Deseret Alphabet characters may not 
display correctly for you)

<?xml version="1.0" encoding="UTF-8"?>

<foo>
  <bar>𐑄𐐮𐑅 𐐮𐑆 𐐾𐐲𐑅𐐻 <word correction="𐐻𐐭">𐑂𐐯𐑉𐐮</word> 𐑁𐐲𐑌𐐮</bar>
</foo>

the output, using your script, is corrupted.  The text() value in the output is 
not the same as the original @correction value.  Extra characters (just one in 
this case) are inserted.  The longer the original attribute value, the more 
extra characters are inserted.

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>𐑄𐐮𐑅 𐐮𐑆 𐐾𐐲𐑅𐐻 <word origerror="𐑂𐐯𐑉𐐮">𐐻𐐻𐐭</word> 𐑁𐐲𐑌𐐮</bar>
</foo>

This kind of corruption is exactly what I was seeing using my own scripts, 
leading me to bother the group.  

I suspect a bug in the XSLT engine involving supplementary characters.  Again, 
I’m using SaxonHE9-7-0-6J.

What’s my next step?

Thanks,

Ken

********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA










********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA



--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>