xsl-list
[Top] [All Lists]

Re: [xsl] Trouble with special characters

2016-01-25 16:45:50
If you are working in Java, be sure that anywhere you are going from bytes
to characters that you are specifying the encoding explicitly and that if
you are generating XML with an encoding declaration that it matches the
encoding you're writing.

Cheers,

Eliot
----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 1/25/16, 2:48 PM, "a kusa akusa8(_at_)gmail(_dot_)com"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Thanks a lot for taking the time to explain this issue in detail. So I
will go back and try to debug the java code and see if the encoding is
set correctly here.



On Mon, Jan 25, 2016 at 1:35 PM, Eliot Kimber ekimber(_at_)contrext(_dot_)com
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
For a situation like this you have to look closely at the chain of
custody
of the data as it comes in and out of different tools--any component
that
touches it has the opportunity to mess things up.

As others have pointed out, if the data coming in is correct then the
data
going out as produced directly by Saxon should be correct as well. That
is, the mapping from Unicode characters to ISO-8859 should be handled
correctly by the serializer Saxon is using.

The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
"REPLACEMENT CHARACTER" interpreted as individual Unicode characters.
The
UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is
inverted
question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
(http://www.fileformat.info/info/unicode/char/0fffd/index.htm)

So the following is happening somewhere in your tool chain:

1. Something is not recognizing the character you think should be a
degree
symbol as a known Unicode character and is replacing it with the UTF-8
replacement character.

2. Something is then reading the bytes resulting from (1) as ASCII
rather
than UTF-8 and treating each byte of the replacement character sequence
as
individual ASCII characters.

3. The remaining stages don't know any better and continue to treat the
characters as characters, resulting in the three characters i-umlaut,
inverted question mark, 1/2 fraction in the output.

I think the most likely thing is that something is reading the incoming
ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol)
as
a unicode character (because it's not one in any Unicode-defined
encoding), and replacing it with the Unicode replacement character.

Something then reads this byte sequence as ASCII, not UTF-8 but then
generates UTF-8 output (otherwise the byte sequence would be the same on
input and output), resulting in the gibberish.

Some tools write XML in one encoding but put in a different encoding
declaration, e.g., a file is written as ISO-8859 but with a UTF-8
encoding
declaration. This would lead to the behavior we're seeing here, where
the
degree symbol should be encoded as two UTF-8 bytes but is output as a
single ASCII byte.

Using Java it's easy to forget to specify the encoding when writing a
byte
sequence using a Writer or when constructing a String instance. This
will
result in the bytes being written in the default encoding for the system
running the application, which is almost always *not* a Unicode
encoding,
rather than an Unicode encoding. Other languages have similar pitfalls.

I find the free Windows tool Unipad to be invaluable when trying to
track
down this type of encoding problem--it does a good job of guessing the
real encoding and also has tools for converting between many encodings,
inspecting files in uncommon encodings, and so on. However, oXygenXML
has
a lot of good tools for this now, so I depend on Unipad less than I used
to 10 years ago. (http://www.unipad.org/main/)

Good luck.

Cheers,

Eliot

----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 1/25/16, 12:36 PM, "a kusa akusa8(_at_)gmail(_dot_)com"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

The transformed XML itself has the gibberish value for the degree
symbol. So it displays as question marks in IE.

There is a java program that uses the transformation factory to
convert the XML. I view the results XML Spy.

On Mon, Jan 25, 2016 at 12:17 PM, Martin Honnen 
martin(_dot_)honnen(_at_)gmx(_dot_)de
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
a kusa akusa8(_at_)gmail(_dot_)com wrote:

And you have <xsl:output omit-xml-declaration="no"/> as well? Does
the
result have an XML declaration? -Yes, there is an XML declaration.

Does XML Spy indicate the encoding used to display the file?- Not
sure
where to see this. The transformed XML has the encoding set to
ISO-8859-1.


What happens when you load the XML result into a browser like IE or
Firefox?
Are the characters displayed as you want them?

As for using Saxon, how do you use, do you run it from the command
line
yourself, with -o:result.xml output option? Or is XML Spy running
Saxon
and
maybe not doing it right?







--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>