Re: [xsl] Applying Streaming To DITA Processing: Looking for Guidance

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 9.10.2014 16:16, Eliot Kimber ekimber(_at_)contrext(_dot_)com wrote:

Can streaming help, either with overall processing efficiency or
with memory usage?


Yes, the typical motivation for streaming is saving memory
consumption, in your case it's very unlikely that you can gain any
performance benefits.

Where would I go today or in the near future to gain the
understanding of streaming required to answer these questions
(other than the XSLT 3 spec itself, obviously)?


There were several talks and papers presented in past years both at
XML Prague and Balisage conferences. For example:

https://www.youtube.com/watch?v=OeSQ4ompB1g&index=6&list=PLQpqh98e9RgXPGvJaNsE3b1Sqncz6MGvr

https://www.youtube.com/watch?v=kzGZvh-FbNw&list=PLQpqh98e9RgXPGvJaNsE3b1Sqncz6MGvr&index=7

If there is enough interested I can try to organize streaming workshop
or something like that as a part of XML Prague 2015 (http://xmlprague.cz)

Because my data collection process is copying data to a new result,
I'm pretty sure it's inherently streamable: I'm just processing
documents in an order determined by a normal depth-first tree walk
of the map structure (a hierarchy of hyperlinks to topics) and
grabbing relevant data (e.g., division titles, figure titles, index
entries, etc.). If this was all I was doing, then for sure
streaming would help memory usage.

But because I must then process each topic again to generate the
final result, and that process is not directly streamable, would
streaming the first phase help overall?


You can split your transformation into two steps -- first will be
streamable and second will not. Compared to the current situation you
will save around 50% memory.

Taken a step further: are there implementation techniques I could
apply in order to make the second phase streamable (e.g.,
collecting the information needed to render cross references
without having to fetch the target elements) and could I expect
that to then provide enough performance improvement to justify the
implementation cost?


You can do this. You can process "compiled grand-source document" in a
streaming mode and make lookups in smaller document with
cross-referencing data in a non-streaming mode.

The current code is both mature and relatively naive in its
implementation. Reworking it to be streamable could entail a
significant refactoring (maybe, that's part of what I'm trying to
determine).

The actual data processing cost is more or less fixed, so unless
streaming makes the XSLT operations faster, I wouldn't expect
streaming by itself to reduce processing time.


It's very unlikely that streaming rewrite will make your code faster.
Of course lookups in a small cross-ref auxiliary file will be faster
than in a large document, but if you use keys today, it shouldn't be
very big difference.

However, the primary concern in this use case is memory usage:
currently, memory required is proportional to the number of topics
in the publication, whereas it could be limited to simply the
largest topic plus the size of the collected data itself (which is
obviously much smaller than the size of the topics as it includes
the minimum data needed to enable numbering and such).


I don't know how large is your documentation set, but I would be
surprised if it couldn't fit into memory (who would read it then? :-).
Streaming is generally useful when it's impossible to load documents
into memory -- which on current machines means processing gigabytes
large XML files.

                                        Jirka


- -- 
- ------------------------------------------------------------------
  Jirka Kosek      e-mail: jirka(_at_)kosek(_dot_)cz      http://xmlguru.cz
- ------------------------------------------------------------------
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
- ------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
- ------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
- ------------------------------------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iEYEARECAAYFAlQ2re4ACgkQzwmSw7n0dR6shwCffITFOIsRjAVeUE+XI4c6vHmt
UEAAn1ssKI6bxGb59UYqi67McfirpoL1
=a1hq
-----END PGP SIGNATURE-----
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--