Re: [xsl] BIDI problem in XSL-FO

Hi Geert and Ken,

Thanks a lot for the reminder to look for some context.

We are using Antennahouse, so we get a huge amount of correct solutions out of 
the box. As we are doing automated publishing, there is no good way to add 
markup later, just for publishing reasons.

But, we happen to have an element <nt> available which is used to tag 
non-translateable content. In my shortened example

<fo:block>Brand name (Former name)</fo:block>

this element was used to tag both brand names in the source, similar to this:

<p><nt>Brand name</nt> (<nt>Former name</nt>)</p>

If I would now use <fo:bidi-override direction="ltr"> for all those <nt>, i.e. 
excluding the parentheses, I get it rendered like this:

(Former name) Brand name

This - as far as I am concerned - makes a lot of sense, as the general reading 
direction is from right to left, and this way the less important information in 
parentheses comes 'after' the main information. It is fun to know, that both 
parentheses are now mirrored glyphs.

I have to wait for some feedback from my customer’s proofreaders.

Thanks for being able to discuss this.

- Michael

Am 29.04.2016 um 23:42 schrieb Geert Bormans 
geert(_at_)gbormans(_dot_)telenet(_dot_)be 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>:

Hi Michael,

It is late on a Friday here, so I ll keep my post very brief
(an excuse for not exactly brief but rather unstructured :-)
Over the past couple of months I have been tackling quiet a few issues 
similar to what you describe.
The visual rendering "(Brand name (Former name" is often the correct 
behaviour, but I learned from Arabic proof readers it is not pleasing anyway

The brackets around English text are one infamous known issue
another one I had issues with is registered trademarks appearing randomly 
before or after an english word.
And keeping 1-25 as page number in the footer for chapter-page numbering 
instead of 25-1 has been hard too

The algorythms for switching rtl and ltr are complex and although there is 
very good support in some FO processors,
behaviour is not always predictable

We learned there is an advantage in creating inner context to tune the 
algorythms our way
our documents are DITA, so I pulled the DITA files out of the CMS and added a 
<term> element context around english text in arabic
but ONLY when there are potential issues (cases are rather isolated so if 
there are no brackets eg. just leave the english text as it is in order to 
not make mistakes)
We had cases like this
"A arrow B"
no arabic characters in there
in arabic the result needed to show
"B flipped arrow A"
That will not happen correctly if you create your bidi override to large 
(your suggested regex would break this example)

from learning the hard way: advice no 1: be conservative in creating bidi 
overrides because most often the FO processor does the right thing

I am revising my regular expressions over the next week or two because of 
toolset version changes

From my experience you are doing the right thing using bidi override (happy 
to learn otherwise from this thread)
I am confident that depending on the tools you use AND the proofreaders 
(opinions differ) that you should experiment your own best matching regex

So advice no 2: test different versions of your toolset and test them well.
When I first started working on arabic manuals with lots of english terms in 
them, Antenna House was my best option and did a very very good job already.
For fixing the issues left in the manuals we added a bidi override context 
(<term> element)
In the 6.3 release and the 6.3 Maintenance Release 1, Antenna House largely 
improved the handling of bidi overrides.
We soon realised that Antenna House did fix some of our issues for us, so we 
are in the process of undoing some of our context fixes

Hope this helps at least a little
Depending on the popularity of this topic, happy to discuss more details of 
findings on this forum or outside of it

Best regards

Geert

----- Oorspronkelijk bericht -----
Van: "Michael Müller-Hillebrand mmh(_at_)docufy(_dot_)de" 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>
Aan: "xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com" 
<XSL-List(_at_)lists(_dot_)mulberrytech(_dot_)com>
Verzonden: Vrijdag 29 april 2016 20:05:07
Onderwerp: [xsl] BIDI problem in XSL-FO

Dear experts,

The processing done by an FO formatter for right-to-left (RTL) languages is 
nearly magic, considering what happens if you just set

writing-mode="rl-tb"

I really enjoy my first project with Arabic text. Interestingly the problem 
at hand are English words. In the glossary of an RTL document I suddenly have 
a full paragraph full of latin characters:

<fo:block>Brand name (Former name)</fo:block>

This is visually rendered like this:

(Brand name (Former name

I have looked at

* Unicode BIDI Processing <http://www.w3.org/TR/xsl/#d0e4879>
* Unicode BIDI algorithm <http://www.unicode.org/reports/tr9/>

I now understand that there are strong and weak characters. The sequence of 
strong Latin characters with embedded 'weak' spacing and punctuation is 
rendered LTR, the closing 'weak' parenthesis is treated as RTL, because this 
is the default orientation of the paragraph.

My first idea is to add <fo:bidi-override direction="ltr"> to each block or 
maybe only each text node that consist of solely non-Arabic characters. I 
guess this could be done using a regular expression like

not(matches($text, '\p{Arabic}'))

Do you have any other recommendations or best practices?

Thanks,

- Michael

--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

signature.asc
Description: Message signed with OpenPGP using GPGMail