xsl-list
[Top] [All Lists]

Re: [xsl] creating a collection from an archive

2018-04-19 16:09:28
Try renaming the .docx file with a .jar or .zip file extension and then using 
it directly as the collection URI - Saxon should recognize it and give you 
access to the contained files as a collection.

If that works, you could register your own CollectionFinder that subclasses the 
StandardCollectionFinder and overrides the method isJarFileURI() to recognize 
the file extension ".docx".

You can then either use collection() function to get the set of documents in 
the ZIP file, or you can use uri-collection() to get their URIs, in a form that 
you can supply as arguments to the doc() function.

You may also need to do something like 
Configuration.registerFileExtension("doc", "application/xml") so that .doc 
files are recognized as containing XML. Generally there's a lot of powerful 
machinery in Saxon for customizing the way collections are handled.

Michael Kay
Saxonica

On 19 Apr 2018, at 20:07, Graydon graydon(_at_)marost(_dot_)ca 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

So I have a Word document, localtest.docx, which is in the 2016 strict
version of the OOXML standard.  As such, it's a zip archive of a bunch
of XML files.  I want to apply XSLT to the XML files.

I could use the arch module and the collection function to write the whole
thing to disk and then load it from disk as a collection before doing whatever
to it and writing it to disk as an archive again, but this seems inefficient.
It would be better to read the archive into an in-memory collection, 
manipulate
it, and then write that back out as an archive.

I'm using XSLT 3.0 via Saxon 9.8.0.8 in oXygen.

<xsl:variable name="wordArchive" as="document-node()+">
  <xsl:variable name="arch" select="file:read-binary($wordArchiveURI)"/>
  <xsl:variable name="entries" select="arch:entries($arch)"/>
  <xsl:variable name="dirs" select="$entries[ends-with(.,'/')]"/>
  <xsl:sequence select="for $x in ($entries except $dirs)
                     return arch:extract-text($arch,$x) => parse-xml()" />
</xsl:variable>

works, in that I get a sequence of document nodes and those documents have the
expected XML content.

I don't get document nodes with associated document-uri() values or any of the
rest of the archive structure.  Those URIs are in the values returned by
arch:entries but I'm not seeing how I assign a document-uri value to a 
document
node.  xsl:document doesn't seem to have a facility for assigning a
document-uri value and of course you can't create an attribute whose parent is
a document node even if document-uri was an attribute in the first place.

What I want is a collection where the structure matches the Word archive,
various subdirectories and all, and I can use the doc() function to access
various compontent documents.  I can't shake the feeling that I'm missing
something obvious, but this feeling is no help in discerning what the obvious
thing is!

Thanks!
Graydon

--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>