xsl-list
[Top] [All Lists]

Re: [xsl] XSLT3 - Streaming + Recursive File Output

2016-08-11 18:13:22
(A) don't equate xsl:fork with multi-threading. In fact, the current 
implementation of xsl:fork in Saxon is not multi-threaded (xsl:result-document 
might be, but you can switch it off). (Saxon's streamed processing uses a push 
model, which complicates many things, but pushing parser events to multiple 
consumers doesn't require multitple threads).

(B) I think your recursive named template can be replaced with a streamable 
call on xsl:for-each-group, something like

<xsl:for-each-group select="*:species" group-adjacent="(position()-1) idiv 
1000">
  <xsl:result-document href="species{position()}.xml">
    <species><xsl:copy-of select="current-group()"/></species>
  </xsl:result-document>
</xsl:for-each-group>

Compared with your approach, this solution has the advantage of not imposing an 
arbitrary limit on the number of elements to be processed.

(C) I would expect the initial unnamed mode should be streamable.

(D) In the latest XSLT 3.0 we've provided "streamable stylesheet functions" - 
not yet implemented in Saxon - but we stopped short at streamable named 
templates. But you couldn't do this kind of batching using streamable 
stylesheet functions either. A human reader can see in your code that the Nth 
recursive call of the template is always processing nodes that are later in 
document order than the (N-1)th recursive call, but it would require a 
phenomenal amount of analysis for a theorem-prover to establish that during 
static analysis, and even if you could prove it streamable, generating a 
streamable execution plan would be far from trivial.

Michael Kay
Saxonica


On 11 Aug 2016, at 23:07, Mailing Lists Mail daktapaal(_at_)gmail(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

Dear All,
I have the following problem to solve using XSLT3 Streaming , which I
have been trying for some time now and i find a road block no matter
which way I choose. Seems to be an interesting issue to solve, which
when resolved, will be a very good learning for me.

I have a HUGE XML ( obviously a starting point for XSlt3 Streaming)

I am using : SaxonEE9-7-0-7J

Problem Definition

1. Remove a set of nodes(Species) from the source
tree(UniverseKingdom.xml), which can be  around 1000,000
2. Create a File called UniverseKingdom-without-species.xml which has
every element in UniverseKingdom, except the Species nodes
3. Create batches of 1000 species and throw them out into
AnimalKingdomSpeciesBatch1.xml and so on and so forth till all the
Species are covered.

So when the Program runs, I get
1. UniverseKingdom-without-species.xml  and 1000 files , each with
1000 Species, with appropriate file names
AnimalKingdomSpeciesBatch1.xml ... to
AnimalKingdomSpeciesBatch1000.xml

What I did so far ( after many attempts and which I thought should
work  but did not work )
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
   xmlns:xs="http://www.w3.org/2001/XMLSchema";>
   <xsl:mode name="stream" streamable="yes" on-no-match="shallow-copy"/>
   <xsl:strip-space elements="*"/>
   <xsl:output method="xml" indent="yes"/>
   <xsl:template match="/">
       <xsl:result-document href="output\UniverseKingdom-without-species.xml">
           <xsl:stream href="UniverseKingdom.xml">
               <xsl:fork>
                   <xsl:sequence>
                       <xsl:apply-templates mode="stream"/>
                   </xsl:sequence>
                   <xsl:sequence>
                       <xsl:for-each
select="*:UniverseKingdom/*:AnimalKingdom">
                             <!-- Call Recursive Templates here -->
                           <xsl:call-templates name="batch-animal-species"/>
                       </xsl:for-each>
                   </xsl:sequence>
               </xsl:fork>
           </xsl:stream>
       </xsl:result-document>
   </xsl:template>
   <xsl:template name="batch-animal-species">
       <xsl:param name="limit" select="1000000"/>
       <xsl:param name="batch" select="1"/>
       <xsl:param name="start" select="1"/>
       <xsl:param name="end" select="1000"/>
       <xsl:if test="$start &lt;= $limit ">
           <xsl:result-document
href="output\AnimalKingdomSpeciesBatch{$batch}-.xml">
               <species>
                   <xsl:for-each select="*:species[position() =
($start to $end) ]">
                       <species>
                           <xsl:copy-of select="."/>
                       </species>
                   </xsl:for-each>
               </species>
           </xsl:result-document>
           <xsl:call-template name="batch-animal-species">
               <xsl:with-param name="batch" select="$batch+1"/>
               <xsl:with-param name="start" select="$end+1"/>
               <xsl:with-param name="end" select="$end+1000"/>
           </xsl:call-template>
       </xsl:if>
   </xsl:template>
   <xsl:template match="*:species" mode="stream"/>
</xsl:stylesheet>


Here, the issue was with the template batch-animal-species . Saxon
Throws Error :

e:\perf\xslt3>java  -jar saxon9ee.jar   str.xml splitter.x
sl  -o:StreamAni.xml
Static error at xsl:template on line 22 column 91 of splitter.xsl:
 XTSE3430: Template rule is declared streamable but it does not
satisfy the streamability rules.
 * Operand . of CallTemplate#batch-animal-species selects streamed nodes in a
context
 that allows arbitrary navigation (line 43)
Errors were reported during stylesheet compilation


I know that the logic for chunking various batched files could be made
better or even questionable.. But I was not expecting that the
Call-Template will fail.

I am hoping some ninja warriors of XSLT3 can help me with this issue//
Seriously can not take No for an answer :) a lot is dependent on this
...

Also, if someone can think of an intelligent way for me to get this
done with a smarter code, and possibly without using fork( there is a
admin sitting somewhere in the System who has asked us to create code
without the multiple threads. He wants to be responsible for the
number of threads and discourages people from spawning multiple
threads. If not possible, then I will enforce that forking has to be
done.)
Please help ...
Dak.Tap

--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>