xsl-list
[Top] [All Lists]

Re: [xsl] Nesting a flat XML structure

2018-10-29 18:02:24
Hi Gerrit,
This is excellent stuff. I have to admit that I haven't yet got to this level 
of fine detail. Maybe never will, but I'll follow your links and read them with 
interest. 
Thank you
~ Ian

-----Original Message-----
From: Imsieke, Gerrit, le-tex gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> 
Sent: 29 October 2018 22:13
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Nesting a flat XML structure

Hi Ian,

We also found word processor named styles to be used too inconsistently to be 
useful for list nesting. Only in very few workflows it is possible to demand 
properly marked up lists without any indentation/numbering overrides. It is not 
uncommon that authors prefer, for example, the (a), (b), (c) listing that a 
templates offers for level 2 lists over the 1., 2., 3. listing that the 
templates offers for level 1 lits. Since understanding and configuring word 
processors’ list settings is not that easy, authors sometimes manually change 
the list item indentations in order to convey the desired visual appearance:
(a) This is a "List 2" item
(b) This is another "List 2" item
     1. This is actually a "List 1" item
     2. This is another "List 1" item
(c) This is another "List 2" item

Most of the times, we therefore rely on a “visual” XSLT 2.0 list nester [1] 
that uses the equivalent of CSS margin-left and text-indent properties (from 
the intermediate DocBook/CSS-based normalization format that Wendell mentioned 
[2]).
The first level has for example a margin-left of 18pt and a text-indent of 
-18pt. The next level has a margin-left of 36pt and a text-indent of -18pt.
We then group adjacent list items that have the same amount of margin-left + 
text-indent, with some tolerance (1.5pt) allowed.
We also take into account leading tabs and their widths. They are sometimes 
used in lieu of proper left margins in (short) list continuation paragraphs or 
in other more deeply nested items.

The list marker content, in particular the numbering, will either be calculated 
according to the complex OOXML (or the less complex IDML) rules, or we will 
take into account the literal values that the author/typesetter chose to use. 
Sometimes there is even a mix of calculated and verbose list numbers within the 
same list.
Then we try to determine a coherent list type (lower alpha, arabic, bullets, …) 
for a given nesting section. If no list type may be determined, we will turn it 
into a definition list.
The whole multi-pass XSLT process is orchestrated by an XProc pipeline [3]. It 
may be customized by importing the XSLT and supplying the customized XSLT to 
the pipeline on the stylesheet port.

I recently estimated [4] that the heuristic visual nesting took approx
300 hours to implement (with some iterations), the OOXML list number 
calculation took some 240 hours, and the IDML list number calculation took ~60 
hours.

So what Graydon said is true: You can hack a docx converter that does 80% of 
the work in a week, but then you need to rely on named styles, among other 
restrictions.

Gerrit

[1] https://github.com/transpect/evolve-hub/tree/master/lists-by-indent/xsl
[2]
http://archive.xmlprague.cz/2013/presentations/Conveying_Layout_Information_with_CSSa/CSSa_xmlprague_gimsieke.html
[3]
https://github.com/transpect/evolve-hub/blob/master/xpl/evolve-hub_lists-by-indent.xpl
[4] https://twitter.com/letexml/status/1045224789097492480
On 29.10.2018 22:04, ian(_dot_)proudfoot(_at_)itp-x(_dot_)co(_dot_)uk wrote:
Agreed Wendell and Graydon.
I am already doing multiple passes to get the content in a suitable state to 
do the nesting part. I find that most word processed text is in a poor state 
for easy conversion to good XML that is valid to a specific schema. When 
based simply on paragraph and character style names the end result is often 
unusable. So I use temporary attributes that encode the important stylistic 
overrides - capturing what the author was trying to achieve. I have been very 
pleased with the results.

Ian

-----Original Message-----
From: Wendell Piez wapiez(_at_)wendellpiez(_dot_)com 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com>
Sent: 29 October 2018 20:17
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Nesting a flat XML structure

Hi,

Yes, what Graydon says (multiple passes).

Here's a simple pass that wraps lists recursively based on a function that 
determines a list level for an element in a flat sequence:

https://gitlab.coko.foundation/XSweet/XSweet/blob/master/applications/
list-promote/mark-lists.xsl

It can be followed by a pass to make lists for the wrappers (in this case 
HTML):

https://gitlab.coko.foundation/XSweet/XSweet/blob/master/applications/
list-promote/itemize-lists.xsl

Because the wrapper is abstracted, either/both the XSLTs can be modified 
separately.

Using XSLT 3.0 they can be chained together (poor man's pipeline) -- or of 
course you can Do It With Modes:

https://gitlab.coko.foundation/XSweet/XSweet/blob/master/applications/
list-promote/PROMOTE-lists.xsl

However (as I think Graydon also implies), frequently the requirement is so 
far away from the generic, that it is easier to code it to the case.

Cheers, Wendell

On Mon, Oct 29, 2018 at 3:02 PM Graydon graydon(_at_)marost(_dot_)ca 
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

On Mon, Oct 29, 2018 at 06:52:59PM -0000, Martin Honnen 
martin(_dot_)honnen(_at_)gmx(_dot_)de scripsit:
[snipped examples]
though so keeps the "ul" lists separated from the sibling "p"
elements, have so far not understood why a list belongs into a preceding 
paragraph.

I have so far found that taking a word processor format flat sequence 
of elements and properly nesting the lists takes interpreting the 
source for level, labelling the list with that level (generally via 
disposable attribute), and then performing a distinct nesting pass 
where the final list item of a list "eats" the immediate 
follow-sibling lists if the list has a lower level-label than this 
list.  Especially when you have complex list items (tables, multiple 
paragraphs, notes...) it's generally just easier to approach the 
problem as a sequence of passes over the content.

-- Graydon




--
Wendell Piez | http://www.wendellpiez.com XML | XSLT | electronic 
publishing Eat Your Vegetables 
_____oo_________o_o___ooooo____ooooooo_^



--
Gerrit Imsieke
Geschäftsführer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 
341 355356 510 gerrit(_dot_)imsieke(_at_)le-tex(_dot_)de, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / 
Registration Number: HRB 24930

Geschäftsführer / Managing Directors:
Gerrit Imsieke, Svea Jelonek, Thomas Schmidt
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>