Re: character encodings

Pam Huntley wrote:

I'm having a problem where my XML file is in utf-8 (and has
english characters in it), but my XSL file has DBCS characters in it, and
although I saved it as UTF-8, I don't really know what the encoding is (I
think for japanese it's ms_kanji, big5 for chinese).
[...]


It is the responsibility of the XML document author (whether that XML doc is
an XSLT stylesheet or any other kind of XML) to know what the encoding of
their document is, and to accurately declare it in the encoding declaration
part of the prolog...

<?xml version="1.0" encoding="whatever"?> at the top of your document,
even if it is a stylesheet.

This is a requirement for well-formedness (it is an error if you misdeclare 
the encoding, though it is not always detectable, such as when you have only 
ASCII characters and you say it's anything but utf-16 or ebcdic).

When I go to transform using the microsoft msxml stuff, I get an error 
saying the XSL does not contain a document element.  However, if I use the 
exact same XSL, only the untranslated version (or any single byte version), 
saved as utf-8, it works.


Right, utf-8 uses 1 to 4 bytes per character in unambiguous sequences, while
these other encodings tend to use 2 or 4 per character, or 1 per character but
with the interjection of certain bytes to "shift" into an alternate "page" in
their character maps, thus requiring stateful decoding algorithms. You can't
expect an XML parser to know that up until byte x in your file the encoding is 
utf-8 and then suddenly it switches to big5.

I got the strings translated, and they came back in an ANSI file.


By ANSI do you mean windows-1252? I don't see how that could be, because
there are less than 256 characters in windows-1252, and none of them are
in CJK scripts. You said you get them as big5 or whatever.

I couldn't send the XSL off to be translated because our translation centers
don't really know what to do with it.  Then I used a program to go replace
the strings back where they belong in the XSL.


Yeah, you can't really do that. You're pasting encoded strings (bytes) into
the middle of a bunch of bytes derived through some other encoding. You can
only do that if your encodings are the same, and even then, it's not an
advisable way to go about things.

So, for single byte
languages, I save the resultant XSL in utf-8 and everyone is happy.   But
for the DBCS languages, even if I save the resulting file in utf-8, I get
the error.

I don't have any control over the XML file - it comes from a server, and I
just save it to a file.  Is there some way to make the XSL work, even if it
is not utf-8?


You really need to know what the encoding is of what you're getting back. I
don't know the API, exactly, but you use that info to decode all your strings
into Unicode string objects. Then you can stitch them together however you
want, and then encode the entire result as utf-8.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list