Re: UTF-8 Byte Order Mark giving problems

Hi Geert,

I'm afraid there is not a nicer solution. You can look inside the Xercesto XMLEntityManager.setupCurrentEntity() to see how this is handled inXerces:


// Special case UTF-8 files with BOM created by Microsoft
// tools. It's more efficient to consume the BOM than make
// the reader perform extra checks. -Ac
if (count > 2 && encoding.equals("UTF-8")) {
 int b0 = b4[0] & 0xFF;
 int b1 = b4[1] & 0xFF;
 int b2 = b4[2] & 0xFF;
 if (b0 == 0xEF && b1 == 0xBB && b2 == 0xBF) {
  // ignore first three bytes...
  stream.skip(3);
 }
}

Best Regards,
George
---------------------------------------------------------------------
George Cristian Bina
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com


Geert Josten wrote:

Hi all,
This is perhaps a bit off-topic, but I can't believe none of you hasnoticed this before. I'm using a Java 1.4.1 distribution (includingXalan 2.5.1?) and am reading an XML document with a DocumentBuilderobject through the parse method. This works okay.
However, when the XML document is UTF-8 *and* includes a UTF-8 ByteOrder Mark (first three bytes EF BB BF), than the parse method simplybreaks with an obscure message that the document element could not befound.
Has anyone noticed this as well? If so, is there a solution?
I've written a FilterInputStream that cuts these first three bytes out,but there has got to be a nicer solution...
Thnx,
Geert

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--