On Sun, Jul 03, 2016 at 04:13:09PM -0000, Terry Badger
terry_badger(_at_)yahoo(_dot_)com scripsit:
Graydon, The document.xml I have found and worked with taken from a
.docx file always have a prolog that has encoding="UTF-8" so I have
not worried about invalid Unicode characters and can process any text
in Word using an xsl stylesheet. Do you have a sample where a docx
file has non Unicode encodings?
Not on hand, and if I did, it wouldn't be my data to share.
I've hit two cases of code point 96 -- a codepage 1252 n-dash -- in an
XSLT document (which is admittedly not Word) during paid work in the
last couple weeks, though. It does happen. It won't cause problems
until something checks for UTF-8 encoding specifically, rather than the
XML character set. It's entirely possible to have the whole XSLT
toolchain completely happy -- as it was in that case -- and something
downstream -- checking for encoding -- not happy at all. I have
certainly hit this problem with the XML versions of Office documents in
the past.
Pre-XML ver 5, it was possible to trust the parser to tell if your
document wasn't UTF-8 because XML's character set was a subset of UTF-8.
With ver 5, that's no longer the case.
-- Graydon
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--