Hi Gerrit,
On Tue, Oct 11, 2016 at 3:29 PM, Imsieke, Gerrit, le-tex
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
But do we know that the characters are just bytes?
Sometimes UTF-8 is being read as if it were ISO-8859-1 or CP-1252 (which
is more likely on Windows) and then saved as UTF-8. Then ’ are 3
(multibyte) UTF-8 characters.
This is very similar to some of the advice that Liam shared with me; i.e.
something from a Windows server (I'm fairly sure that's the OS for the
application generating the $input.xml files) is reading UTF-8 and outputing
it as ISO-8859-1.
If this is the case, you can correct it with
iconv -t WINDOWS-1252 -f UTF-8 input.xml | sed -e 's/
encoding="iso-8859-1"/ encoding="UTF-8"/' > output.xml
:) now *this* is different. This replaces the ISO/CP-1252/... with U+FFFD,
which is arguably an improvement.
Gerrit
Bridger
On 11.10.2016 21:23, Wolfgang Laun wolfgang(_dot_)laun(_at_)gmail(_dot_)com
wrote:
The characters E2 80 99 are the UTF-8 encoding of the Unicode character
RIGHT SINGLE QUOTATION MARK.
Simply changing the ISO-8859-1 in your XML file to UTF-8 should fix this.
On 11 October 2016 at 21:00, Bridger Dyson-Smith
bdysonsmith(_at_)gmail(_dot_)com
<mailto:bdysonsmith(_at_)gmail(_dot_)com>
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com
<mailto:xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>> wrote:
Hi all,
I'm struggling with a character encoding issue (or a character
representation issue maybe?): I have input XML that looks like this
input.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<documents>
<document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and
scholarly reports since the 1930’s.</document>
</documents>
and I would like to get it to a point where the characters are
represented properly, i.e.
output.xml
<?xml version="1.0" encoding="UTF-8"?>
<documents>
<document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and
scholarly reports since the 1930’s.</document>
</documents>
Thanks to Liam's help on irc and reading through the list archives,
it seems like an identity transform should be the right step towards
getting the representation corrected, but something isn't working
(or I have a misunderstanding somewhere).
If I apply the following identity transform with Saxon HE 9.6.0.7 in
oXygen 18:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform
<http://www.w3.org/1999/XSL/Transform>"
version="2.0">
<xsl:output encoding="UTF-8" indent="yes"/>
<xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
</xsl:stylesheet>
I get the following result:
<?xml version="1.0" encoding="UTF-8"?>
<documents>
<document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and
scholarly reports since the 1930’s.</document>
</documents>
Could someone provide some insight into what I've done wrong here?
Any help would be greatly appreciated.
Best,
Bridger
XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <-list/528976> (by email)
XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <-list/225679>
(by email <>)
--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930
Geschäftsführer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard Vöckler
------------------------------------------------------------
------------------
Meet us at Frankfurt Book Fair:
Hall 4.2, Stand L68.
More info at http://www.le-tex.de/en/buchmesse.html
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--