xsl-list
[Top] [All Lists]

Re: [xsl] where to look for xsl folk..

2016-07-03 11:42:58
hi

i also ran a series of tests because i was particularly focused on avoiding 
char loss. The tests looked good but if you have any cases where you know char 
loss happens I'd be very interested to learn more   ...

adam

On July 3, 2016 9:13:02 AM PDT, "Terry Badger 
terry_badger(_at_)yahoo(_dot_)com" 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:
Graydon,
The document.xml I have found and worked with taken from a .docx file
always have a prolog that has encoding="UTF-8" so I have not worried
about invalid Unicode characters and can process any text in Word using
an xsl stylesheet. 
Do you have a sample where a docx file has non Unicode encodings?
Word does have some difficult structures but nothing impossible with
xsl so far.
Terry



On Sunday, July 3, 2016 11:14 AM, "Graydon graydon(_at_)marost(_dot_)ca"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:



On Tue, Jun 21, 2016 at 03:42:05AM -0000, adam adam@coko.foundation
scripsit:
Rather I am looking to convert docx to HTML with xsl. No magic
involved.
Good enough HTML is good enough. I was looking for someone to help me
build this as well structured stylesheets that can be extended later.

The really tough problem here is not "did I get good enough HTML?";
it's
"did any important bits of the text get lost during conversion?"  That
one's brutal.

The sanity-preserving way to do this is to use Libre Office to convert
the docx to Open Document and to go from Open Document XML. The Libre
Office "Save as HTML" facility is likely better than anything you can
write in reasonable time; I'd be looking to take that HTML and tidy it
to meet specific project requirements with XSLT.  (There are API hooks
for doing this in both OpenOffice and LibreOffice.  There are hooks for
applying XSLT as part of that process, too.)

I can't tell you what you want to do, but I desperately do not want to
address docx with XSLT directly, because then I, and not someone else,
will be trying to handle the encoding issues (since XML
I-think-version-five, the awkward cp1252 characters like 97 (em-dash)
or
the smart quotes are legal XML characters, but they're not Unicode
anything; parsing won't find them for you anymore), the specific
peculiarities of an undocumented format intended (for sound commercial
reasons) to be nigh-impossible to convert to other formats, or the
various "it did what with the end notes? It displays end notes, where
are they in the file?" problems you can hit with academic writing.

-- Graydon


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--
<Prev in Thread] Current Thread [Next in Thread>