xsl-list
[Top] [All Lists]

Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

2016-10-11 14:55:05
Hi Bridger,

You may be able to use xsl:character-map to map characters that are not 
transforming correctly into their proper Unicode code points.

I’ve seen plenty of instances where the input files declare one character 
encoding but actually contain characters with a different encoding. If this is 
what you’re facing, it can be helpful to start by doing an analysis of 
character occurrences in the set of input files. You can eliminate characters 
in the ISO646-US range straight off, then eliminate character other codes that 
transform correctly, and then focus on creating a mapping for the remaining 
character codes or character code sequences.

Some Perl modules that can be helpful when dealing with unexpected character 
encodings are Encoding::FixLatin, Encode::Guess, and Text::FixEOL.

Cheers,
Vincent


From: Bridger Dyson-Smith bdysonsmith(_at_)gmail(_dot_)com 
[mailto:xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com]
Sent: Tuesday, October 11, 2016 3:09 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

Hi all,

I'm struggling with a character encoding issue (or a character representation 
issue maybe?): I have input XML that looks like this

input.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<documents>
            <document>The reality of the effect of natural ventilation in a 
residential attic cavity has been the topic of many debates and scholarly 
reports since the 1930’s.</document>
</documents>

and I would like to get it to a point where the characters are represented 
properly, i.e.

output.xml
<?xml version="1.0" encoding="UTF-8"?>
<documents>
            <document>The reality of the effect of natural ventilation in a 
residential attic cavity has been the topic of many debates and scholarly 
reports since the 1930’s.</document>
</documents>

Thanks to Liam's help on irc and reading through the list archives, it seems 
like an identity transform should be the right step towards getting the 
representation corrected, but something isn't working (or I have a 
misunderstanding somewhere).

If I apply the following identity transform with Saxon HE 9.6.0.7 in oXygen 18:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform<http://www.w3.org/1999/XSL/Transform>"
            version="2.0">
                        <xsl:output encoding="UTF-8" indent="yes"/>
                        <xsl:template match="/"><xsl:copy-of 
select="/"/></xsl:template>
</xsl:stylesheet>

I get the following result:
<?xml version="1.0" encoding="UTF-8"?>
<documents>
             <document>The reality of the effect of natural ventilation in a 
residential attic cavity has been the topic of many debates and scholarly 
reports since the 1930’s.</document>
</documents>

Could someone provide some insight into what I've done wrong here? Any help 
would be greatly appreciated.

Best,
Bridger

XSL-List info and archive<http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe<-list/194671> (by email<>)
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>