[xsl] efficiency and replace()

Dear XSLTians,

For a troff-to-XML/Unicode conversion I've implemented a strategy thatproduces the desired result, but that does the conversion to Unicodeslowly, and I would be grateful for advice about improving the efficiency.

I handle the conversion of the structural marked up XML first, and Iwind up with all of my XML tagging in place, but the text strings usetroff escape sequences, rather than Unicode. The text is almost allmedieval Cyrillic, and most of the Cyrillic characters are representedin the troff with sequences of several ascii characters. The strategy Iadopted to convert the troff character encoding to Unicode was to createa mapping file for the troff-to-Unicode character correspondences.Here's a snippet (a single mapping correspondence):


<mapping>
<troff>\(qb</troff>
<unicode>б</unicode>
</mapping>

I then wrote an XSLT script that reads the file of mappings andgenerates another XSLT script that will do the actual remapping. Here'sa snippet of the generated XSLT script; this snippet is taken fromwithin a template rule for text() nodes (the named template that getscalled follows the snippet):


<xsl:variable name="temp52">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp51"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?s</xsl:with-param>
<xsl:with-param name="unicode">щ</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="temp53">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp52"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?c</xsl:with-param>
<xsl:with-param name="unicode">ҁ</xsl:with-param>
</xsl:call-template>
</xsl:variable>
. . .
<xsl:template name="replacement">
<xsl:param name="text"/>
<xsl:param name="troff"/>
<xsl:param name="unicode"/>
<xsl:value-of select="replace($text, $troff, $unicode)"/>
</xsl:template>

The program logic is that for each text node, the template rule passesthe textual contents to a replace() function that replaces a troffencoding with the corresponding Unicode value. The replace() function isthen called again with the next mapping. The textual content is passedalong through repeated remappings, and when it emerges on the other end,all multi-character troff sequences have been replaced with Unicodecharacters. There are 64 such mappings. I use replace() only for placeswhere a multi-character troff string has to be replaced by a singleUnicode character; at the end of the series of calls to replace() I usetranslate() to do the remaining one-to-one mappings (there areapproximately 50 of them) in a single function call. The order of themappings is (obviously) important; I need to remap longer strings beforeshorter ones, since the shorter ones may be subcomponents of the longerones. In particular, I can remap individual characters (the one-to-onemappings) only after I've taken care of all of the many-to-one ones.

The input file (XML with troff character coding instead of the desiredUnicode) is 6.7MB and the Unicode output is 7.8MB. The transformationtakes approximately five minutes to run, which feels like an eternity,but I'm not sure to what extent the execution time reflects the size ofthe input file and the number of replacements that needs to beperformed, and to what extent it reflects inefficient program design.Can anyone suggest a revision that would provide a considerableimprovement in efficiency (bearing in mind that the XSLT script thatdoes the actual character remapping must be generated by XSLT from themappings file)?


Thanks,

David
djbpitt+xml(_at_)pitt(_dot_)edu

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--