Dear XSLTians,
For a troff-to-XML/Unicode conversion I've implemented a strategy that
produces the desired result, but that does the conversion to Unicode
slowly, and I would be grateful for advice about improving the efficiency.
I handle the conversion of the structural marked up XML first, and I
wind up with all of my XML tagging in place, but the text strings use
troff escape sequences, rather than Unicode. The text is almost all
medieval Cyrillic, and most of the Cyrillic characters are represented
in the troff with sequences of several ascii characters. The strategy I
adopted to convert the troff character encoding to Unicode was to create
a mapping file for the troff-to-Unicode character correspondences.
Here's a snippet (a single mapping correspondence):
<mapping>
<troff>\(qb</troff>
<unicode>б</unicode>
</mapping>
I then wrote an XSLT script that reads the file of mappings and
generates another XSLT script that will do the actual remapping. Here's
a snippet of the generated XSLT script; this snippet is taken from
within a template rule for text() nodes (the named template that gets
called follows the snippet):
<xsl:variable name="temp52">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp51"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?s</xsl:with-param>
<xsl:with-param name="unicode">щ</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="temp53">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp52"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?c</xsl:with-param>
<xsl:with-param name="unicode">ҁ</xsl:with-param>
</xsl:call-template>
</xsl:variable>
. . .
<xsl:template name="replacement">
<xsl:param name="text"/>
<xsl:param name="troff"/>
<xsl:param name="unicode"/>
<xsl:value-of select="replace($text, $troff, $unicode)"/>
</xsl:template>
The program logic is that for each text node, the template rule passes
the textual contents to a replace() function that replaces a troff
encoding with the corresponding Unicode value. The replace() function is
then called again with the next mapping. The textual content is passed
along through repeated remappings, and when it emerges on the other end,
all multi-character troff sequences have been replaced with Unicode
characters. There are 64 such mappings. I use replace() only for places
where a multi-character troff string has to be replaced by a single
Unicode character; at the end of the series of calls to replace() I use
translate() to do the remaining one-to-one mappings (there are
approximately 50 of them) in a single function call. The order of the
mappings is (obviously) important; I need to remap longer strings before
shorter ones, since the shorter ones may be subcomponents of the longer
ones. In particular, I can remap individual characters (the one-to-one
mappings) only after I've taken care of all of the many-to-one ones.
The input file (XML with troff character coding instead of the desired
Unicode) is 6.7MB and the Unicode output is 7.8MB. The transformation
takes approximately five minutes to run, which feels like an eternity,
but I'm not sure to what extent the execution time reflects the size of
the input file and the number of replacements that needs to be
performed, and to what extent it reflects inefficient program design.
Can anyone suggest a revision that would provide a considerable
improvement in efficiency (bearing in mind that the XSLT script that
does the actual character remapping must be generated by XSLT from the
mappings file)?
Thanks,
David
djbpitt+xml(_at_)pitt(_dot_)edu
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--