Re: [xsl] how to extract text, translate and re-insert it in XHTML

That sounds like an interesting problem. If the English->Klingontranslator would leave a trace of what translated to what, then it mightbe feasible (though difficult) to reconstruct the inline markup. Failingthat, it seems nigh impossible. But that's assuming the document usesinline markup (which you didn't explicitly specify). If it's a matter ofjust getting different sections back in place, then you'd probably makemultiple calls out to the translator, one for each blob of text. Ofcourse, I suppose you could try the same for inline markup. It justmight come out reading a bit funny and disconnected (but I supposethat's to be expected from an automatic translator anyway...).


<p>Hello this is <strong>bold</strong>. This is <em>italic</em>.</p>

You could call the translator for each non-whitespace-only text node inthe document.


<xsl:template match="/">
 <translator-inputs>
   <xsl:apply-templates/>
 </translator-inputs>
</xsl:template>

<!-- Ignore whitespace-only text -->
<xsl:template match="text()"/>

<xsl:template match="text()[normalize-space()]">
 <to-translator>
   <xsl:copy/>
 </to-translator>
</xsl:template>

For the above document, that would yield:

<translator-inputs>
 <to-translator>Hello this is </to-translator>
 <to-translator>bold</to-translator>
 <to-translator>. This is </to-translator>
 <to-translator>italic</to-translator>
 <to-translator>.</to-translator>
</translator-inputs>

This reveals a further requirement: strip out and reconstructpunctuation that lies at the edges of a text blob (and that thetranslator would likely ignore anyway). You could do this using regularexpressions. I'm not going to trouble myself with that right now, butthe result might look like this:


<translator-inputs>
 <to-translator>Hello this is </to-translator>
 <to-translator>bold</to-translator>
 <to-translator sentence-boundary="yes>This is </to-translator>
 <to-translator>italic</to-translator>
 <to-translator sentence-boundary="yes"/>
</translator-inputs>

I wouldn't worry about commas so much, or even periods in the middle ofa blob of text. Theoretically, the translator will take care of those.It's only when we chop up text near the sentence boundaries (due toinline markup, e.g., a <b> tag) that we'd have to worry about that.

Then you'd hope to construct a result like this with help from thetranslator:


<results>
 <from-translator>Olleh siht si </from-translator>
 <from-translator>dlob</from-translator>
 <from-translator sentence-boundary="yes">Siht si </from-translator>
 <from-translator>cilati</from-translator>
 <from-translator sentence-boundary="yes"/>
</results>

Reconstructing the document, you'd run another transformation againstthe original document, changing only the non-whitespace-only text nodes:


<!-- By default, copy everything unchanged. -->
<xsl:template match="@* | node()">
 <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
 </xsl:copy>
</xsl:template>

<xsl:template match="text()[normalize-space()]">
 <xsl:variable name="text-node-position">
   <xsl:number level="any" count="text()[normalize-space()]/>
 </xsl:variable>
 <xsl:variable name="result"

select="document('translation-results.xml')/output/from-translator[$text-node-position]"/>

 <xsl:if test="$result/@sentence-boundary='yes">. </xsl:if>
 <xsl:value-of select="$result"/>
</xsl:template>

I'll leave it up to you to determine whether the results would beacceptable or not. I think it largely depends on just how much inlinemarkup is being used. Perhaps you'd care less about preserving bold,italics, and other inline markup and care only about paragraphboundaries. That would be much easier, using a similar approach toabove. In that case, a text blob would be passed to the translator foreach paragraph rather than every last text node. Either way, we canidentify each blob of text by position.


Evan


Robert P. J. Day wrote:

  it's been a while since i've written anything in XSLT so i'm going
to try to explain what a colleague is trying to do, assuming *i*
understand it.

  1) start with an involved XHTML document
  2) "extract" just those (english) parts that involve translatable
     text, and hand it to a translator
  3) translator translates english to, say, klingon
  4) rebuild original document with klingon content instead of english

as i understand it, the point of the extraction is that no one wants
to burden the translator with all of the XHTML tagging -- the
translator wants to get the text stripped of all the "clutter", at
which point, after translation, someone needs to be able to put the
document back together.

  is this even a reasonable thing to ask?  in order to reassemble the
document, i'm assuming one is going to have to ID every single bit of
text to have a reference to build backwards.

  thoughts on this?  has anyone done something like this?  or are you
all too busy laughing hysterically by now?

rday



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--