That sounds like an interesting problem. If the English->Klingon
translator would leave a trace of what translated to what, then it might
be feasible (though difficult) to reconstruct the inline markup. Failing
that, it seems nigh impossible. But that's assuming the document uses
inline markup (which you didn't explicitly specify). If it's a matter of
just getting different sections back in place, then you'd probably make
multiple calls out to the translator, one for each blob of text. Of
course, I suppose you could try the same for inline markup. It just
might come out reading a bit funny and disconnected (but I suppose
that's to be expected from an automatic translator anyway...).
<p>Hello this is <strong>bold</strong>. This is <em>italic</em>.</p>
You could call the translator for each non-whitespace-only text node in
the document.
<xsl:template match="/">
<translator-inputs>
<xsl:apply-templates/>
</translator-inputs>
</xsl:template>
<!-- Ignore whitespace-only text -->
<xsl:template match="text()"/>
<xsl:template match="text()[normalize-space()]">
<to-translator>
<xsl:copy/>
</to-translator>
</xsl:template>
For the above document, that would yield:
<translator-inputs>
<to-translator>Hello this is </to-translator>
<to-translator>bold</to-translator>
<to-translator>. This is </to-translator>
<to-translator>italic</to-translator>
<to-translator>.</to-translator>
</translator-inputs>
This reveals a further requirement: strip out and reconstruct
punctuation that lies at the edges of a text blob (and that the
translator would likely ignore anyway). You could do this using regular
expressions. I'm not going to trouble myself with that right now, but
the result might look like this:
<translator-inputs>
<to-translator>Hello this is </to-translator>
<to-translator>bold</to-translator>
<to-translator sentence-boundary="yes>This is </to-translator>
<to-translator>italic</to-translator>
<to-translator sentence-boundary="yes"/>
</translator-inputs>
I wouldn't worry about commas so much, or even periods in the middle of
a blob of text. Theoretically, the translator will take care of those.
It's only when we chop up text near the sentence boundaries (due to
inline markup, e.g., a <b> tag) that we'd have to worry about that.
Then you'd hope to construct a result like this with help from the
translator:
<results>
<from-translator>Olleh siht si </from-translator>
<from-translator>dlob</from-translator>
<from-translator sentence-boundary="yes">Siht si </from-translator>
<from-translator>cilati</from-translator>
<from-translator sentence-boundary="yes"/>
</results>
Reconstructing the document, you'd run another transformation against
the original document, changing only the non-whitespace-only text nodes:
<!-- By default, copy everything unchanged. -->
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<!-- But replace non-whitespace-only text nodes with their translated
counterparts. -->
<xsl:template match="text()[normalize-space()]">
<xsl:variable name="text-node-position">
<xsl:number level="any" count="text()[normalize-space()]/>
</xsl:variable>
<xsl:variable name="result"
select="document('translation-results.xml')/output/from-translator[$text-node-position]"/>
<xsl:if test="$result/@sentence-boundary='yes">. </xsl:if>
<xsl:value-of select="$result"/>
</xsl:template>
I'll leave it up to you to determine whether the results would be
acceptable or not. I think it largely depends on just how much inline
markup is being used. Perhaps you'd care less about preserving bold,
italics, and other inline markup and care only about paragraph
boundaries. That would be much easier, using a similar approach to
above. In that case, a text blob would be passed to the translator for
each paragraph rather than every last text node. Either way, we can
identify each blob of text by position.
Evan
Robert P. J. Day wrote:
it's been a while since i've written anything in XSLT so i'm going
to try to explain what a colleague is trying to do, assuming *i*
understand it.
1) start with an involved XHTML document
2) "extract" just those (english) parts that involve translatable
text, and hand it to a translator
3) translator translates english to, say, klingon
4) rebuild original document with klingon content instead of english
as i understand it, the point of the extraction is that no one wants
to burden the translator with all of the XHTML tagging -- the
translator wants to get the text stripped of all the "clutter", at
which point, after translation, someone needs to be able to put the
document back together.
is this even a reasonable thing to ask? in order to reassemble the
document, i'm assuming one is going to have to ID every single bit of
text to have a reference to build backwards.
thoughts on this? has anyone done something like this? or are you
all too busy laughing hysterically by now?
rday
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--