xsl-list
[Top] [All Lists]

Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

2016-10-11 15:19:46
Yes, then you need a more general solution--basically all your data has been
corrupted by reading UTF-8 files as though they were ASCII but then saving
the result as UTF-8, as Wolfgang surmised.

There must be a general way to undo this corruption but I don't myself know
of an existing tool that would do it. Basically you would need to scan the
document text nodes for sequences of characters that, when interpreted as
single bytes would represent the UTF-8 encoding of a Unicode character. I
suspect that's actually not that hard but not a puzzle I can attempt at the
moment. We know, for example, that "â" corresponds to the first byte of a
three-byte UTF-8 sequence, so searching for that and then doing something
with the two characters following would do it and it's probably a simple
mathematical relation between the Unicode characters and the bites in the
UTF-8 encoding of the original character.

Looking at the bytes of the UTF-8 encoding, the bytes for \u2019 are xE2 x80
x99

The corresponding mangled characters are: \u00E2 \u20AC \u2122

I don't see an obvious mathematical transform there but I'm also recovering
from jet lag and not at my sharpest just now.

Maybe somebody else sees a way to do this generally?

One thing to try would be to simply list out all the bad character sequences
to see what there is--more than one example may suggest a pattern.

You may find that there are few enough you can just make a brute force
replacement transform.

Cheers,

E.
--
Eliot Kimber
http://contrext.com
 


From:  "Bridger Dyson-Smith bdysonsmith(_at_)gmail(_dot_)com"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>
Reply-To:  <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Date:  Tuesday, October 11, 2016 at 3:55 PM
To:  <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Subject:  Re: [xsl] Character encoding/representation from ISO-8859-1 to
UTF-8

Hi Eliot

On Tue, Oct 11, 2016 at 3:36 PM, Eliot Kimber ekimber(_at_)contrext(_dot_)com
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
The characters are not just the ASCII bytes.

I think you will need to match on the characters in question and replace them
with the desired character, e.g.:

<xsl:template match="text()[contains(., '’')]">
<xsl:value-of select="replace(., '’', '’')"/>
<xsl:template/>

And then use a more complete identity transform that handles the text nodes:

Thank you for the response. I'm afraid I'm guilty of providing an incomplete
picture of my issue: I'm not sure what malformed(?) characters are in the
input documents. My apologies for leaving that detail out, but seems like it
would present a fairly significant problem for doing a replace().
 
Cheers,

Eliot

Again, thank for your time and trouble.
Bridger 

--
Eliot Kimber
http://contrext.com
 


From:  "Bridger Dyson-Smith bdysonsmith(_at_)gmail(_dot_)com"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>
Reply-To:  <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Date:  Tuesday, October 11, 2016 at 2:59 PM
To:  <xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com>
Subject:  [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

<?xml version="1.0" encoding="iso-8859-1"?>
<documents>
<document>The reality of the effect of natural ventilation in a residential
attic cavity has been the topic of many debates and scholarly reports since
the 1930’s.</document>
</documents>
XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <-list/1230532> (by email)

XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <-list/1278982> (by
email <> )
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>