xsl-list
[Top] [All Lists]

Re: [xsl] Invalid byte 2 of 2-byte UTF-8 sequence exception while transforming

2007-02-02 05:25:04
Pankaj Bishnoi wrote:
Hi Owen
                The file starts with <?xml version="1.0"
encoding="ISO-8859-1"?> so i think before transforming the encoding of file
is changed to UTF-8(Default encoding for Xalan transformer) and since UTF-8
encoded file cannot contain ISO-8859-1 characters so this might be the cause
of this problem i am still debugging it.

No, UTF-8 is an encoding for Unicode, which can handle all characters fro ISO-8859-1.

If you use Eclipse, you can test the "looks" of your file as follows:

1. Open the XML file as-is.
2. Right-click the file in the Navigator and click Properties
3. Check "Default (determined from content: ISO-8859-1)" (I mean: check what it says there, it should show "ISO-8859-1") 4. Read through your file carefully if you see any small squares (Eclipse's way of showing unknown chars, chars not in the font, or chars that are illegal), if there are some, your file contains illegal encodings. 5. It may be that as the result of illegal characters, Xalan tries to read it as UTF-8 (because that is the default for XML), but ISO-8859-1 and UTF-8 are not the same for characters above codepoint 127, and for these characters it may give this error. 6. Go again to the Properties, and type manually "UTF-8". Check again for any little squares. 7. Make a little change, and change the encoding string to "UTF-8". Eclipse will automatically and correctly save it as UTF-8 now. Change it back to ISO-8859-1. Eclipse will replace any character that is not allowed in ISO-8859-1 with a "?" char. Close and open it to see if it has such changed chars.

If you don't have Eclipse, you can use a text editor where you can select and override the encoding. Even a browser will give you some hints on illegal characters when you select another encoding using the View menu. If you have an editor where you can search with regular expressions, search your document with the following expression (or the equivalent for your regex dialect):

[^\t\n\r\x20-\x79]+

it will give you all "character suspects" that may have gotten the wrong encoding when saving the file. In fact, it gives you all characters that are not allowed in XML when you were to encode your file as US-ASCII (one of the most basic character sets and the first 127 codepoints are equal to all IS0-8859-X and UTF-8 and many other character sets). Testing all these suspects one by one (by removing/changing them), you will quickly find the problem character.

Good luck researching!

Cheers,
-- Abel Braaksma
  http://www.nuntia.nl

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--