RE: Incremental transformations with Xalan and performance issues?

Michael:

Thanks for the response. BTW, I use your XSLT book as my primary 
reference...nice work!

You might find it better to ask such questions on the xsl-list at
mulberrytech.com, or if you're really interested only in Xalan, on a
Xalan-specific forum.


Like many, I suffer from YAL syndrome.  (Yet another list) and am hesitat to 
sub to any more lists, given how much stuff I already receive.  I knew some 
XSLT heavyweights (like yourself) hang here, and hence the decision to post 
to the xml-dev group.  However, I've also now x-posted to the xsl group as 
well.  

I also think that as XML adoption continues to accelerate, transformations of 
extremely 
large documents using XSLT will be more and more a general concern to the 
community.

In general, every mainstream XSLT processor today builds a tree
representation of the input document in memory. I believe Xalan does parsing
and transformation in parallel, but it still builds the tree. The fact that
the parser and the transformer communicate using SAX is irrelevant - it just
means that the transformer and not the parser is building the tree. (This
isn't totally irrelevant, because the transformer can build a much more
efficient tree knowing it is read-only. But it's still an in-memory tree.)


I might have to redesign how we handle our XML in that case, to keep each 
mailmerge 
recipient entry in a separate document, rather than have the whole thing as one 
monolithic document.

Do you happen to know if anyone has tried to build an XSLT engine that does 
incremental 
transformations on incoming SAX events, without requiring the building of a 
tree?  That 
kind of approach, where the transform is appropriate, would be much more 
efficient in 
memory useage and would allow transforms of virtually unlimited size documents 
I should 
think.  Something to investigate...

I can't speak for Xalan, but Saxon users are running transformations up to
200Mb or so without too much trouble, and at speeds up to 10Mb/sec. It
requires a little care in configuring the memory allocation, and in writing
the stylesheet to avoid non-linear constructs, but it's certainly doable.
Beyond that, it probably gets difficult.


I'm using Xalan (inside Cocoon), and for this task have not yet figured out a 
way to use 
Saxon due to some extensions I'm using.  More specifically, I need to get/put 
stuff into 
the session and using something like this (in Xalan):

<xalan:component prefix="javaSession">
        <xalan:script lang="javaclass"  
                                        
src="xalan://org.apache.cocoon.environment.Session"/>
</xalan:component>

Then have templates like:

<xsl:template name="javaCall:setSessionAttribute">
        <xsl:param name="attributeName" select="'unknown'" />
        <xsl:param name="attributeValue"/>
        <xsl:param name="session"/>
                
        <xsl:variable name="dummy" 
                select="javaSession:setAttribute( $session,     $attributeName, 
$attributeValue )"/>
</xsl:template>
        
<xsl:template name="javaCall:getSessionAttribute">
        <xsl:param name="attributeName" select="'unknown'" />
        <xsl:param name="session"/>
                
        <xsl:copy-of select="javaSession:getAttribute( $session, $attributeName 
)"/>
</xsl:template>

The session parameter is a reference to the user's session that is passed in 
from the 
calling stylesheet with a bit of magic from a custom Cocoon transformer class.

This works fine with Xalan, if you save a tree fragment, and then retrieve it, 
you end up 
with a node list/tree fragment as desired.  With Saxon, however, if I instead 
use the 
saxon component definition:

<saxon:script language="java" 
                                implements-prefix="javaSession" 
                                
src="java:org.apache.cocoon.environment.Session"/>

I can save a result fragment, but when I retrieve it, I don't get a node 
list/tree 
fragment.  Haven't figured out how to correct this yet with Saxon.

If it wasn't for this, I could freely change between the two XSLT engines with 
a build 
parameter.

You don't actually say what you mean
by a "large document". (Personally, I am amazed to see people handling a 200Mb
database as a single in-memory document, but perhaps I'm just old-fashioned).


I'm not sure yet...the client has not given me any indication of how big the 
mail merge 
might be.  1M letters would make hit the database limit of 2GB for the xml 
document in 
the table column (clob).  100K letters would hit the 200MB level that you 
mentioned.

I'ld rather implement a solution that has no limitations, so with the lack of a 
true 
"incremental/SAX" based transformer implementation,  I'm thinking that I'll 
need to move 
away from the monolithic document approach and store each recipient's info in a 
separate 
small document to work around the current xslt document size limitations.

If you really need purely serial processing, you might consider STX as an
alternative. However, the existing STX implementations are far less
widely-used or mature than the popular XSLT implementations.


That's not an option in our case, since we rely on xslt so much.


Andrzej Jan Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--