xsl-list
[Top] [All Lists]

Re: [xsl] Generic stylesheet to flatten XML hierarchy

2009-12-07 13:49:28
I know that this may not work in every case. Basically the rules are: 

* every attribute on an element becomes a column in a row
* every element that has data content becomes a column in a row
* repeating elements define a row -- with the further restriction that if there 
are hierarchical levels of repeating elements (nested), the final lowest level 
of repeating elements defines a row and ancestor levels get repeated
* hierarchical relationships get flattened
* siblings at any level that don't repeat get repeated in each row

I'm going to try one last possible solution using keys and XPath, I think, and 
if that does not work I may move on to Michael Kay's suggestion of a 
meta-stylesheet. 

Thanks to everyone for the ideas.

--- On Fri, 12/4/09, C. M. Sperberg-McQueen 
<cmsmcq(_at_)blackmesatech(_dot_)com> wrote:

From: C. M. Sperberg-McQueen <cmsmcq(_at_)blackmesatech(_dot_)com>
Subject: Re: [xsl] Generic stylesheet to flatten XML hierarchy
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Cc: "C. M. Sperberg-McQueen" <cmsmcq(_at_)blackmesatech(_dot_)com>
Date: Friday, December 4, 2009, 6:35 PM
On 4 Dec 2009, at 12:37 , Sara
Mitchell wrote:

...

With input like this:
<rss ...some attributes>
   ...
</rss>

I would like XML output like this:

<root>
<row>
  <rss-attr1>value</rss-attr1>
...
</row>
<row>...again rss attributes, channel
attributes, non-repeating children of channel followed by
fields for second item </row>
...more rows ...
</root>

I'm having trouble seeing exactly what should be going on
here,
because I can't see anything in your sample input (elided
here
without loss of generality) that gives rise to the name
'rss-attr1'.  It's hard to correlate input with output
if
all the values are spelled 'value' and some details in one
half of the input / output pair correspond to ellipses in
the
other.




This example is for a single level of repeating
descendants, but my solution has to be able to handle any
level of repeating descendants. More over, the stylesheet
has no knowledge of the structure of the input document.

My very strong gut reaction here is to suspect that such
an
absolutely generic transformation is unlikely to produce
helpful
(or: meaningful) output in some unknown but possibly large
percentage of cases.

Perhaps the transformation you have in mind is intended to
work generically on all XML documents that follow certain
conventions in structuring the information they represent?
Can you say what those conventions are?

Perhaps you have a very clear understanding of the
transform you
want, but so far this discussion has not elicited a clear
description from you.  The following questions are
intended to
try to elicit some more clarity.

In a generic XML document, there are elements with
parents,
left and right siblings, children, descendants, and
attributes.

In a generic table, there are rows and columns.  Each
row but
the first or last has a predecessor and a successor, and
ditto
each column but the first or last.

What is the relationship between the elements, attributes,
containment and sibling relations in the input, and the
rows and columns and their sequence relations in the
output?

Given your output table, should I expect to have all the
information present in the XML?  Can I recreate the
XML from
your table?

Do all your rows have the same number of columns?  (I
suppose
they must, or it's not much of a table, but perhaps I'd
better check?)

When does an XML document give rise to a single row in the
output
table?  When does it give rise to exactly three
rows?  When
does the resulting table have exactly one column?

What information do the labels of columns convey?

What tables would you want to produce for the documents

(1) <e/>
(2) <e><e n="23"/><e
n="45">Pax</e></e>
(3) <table>
    <row a="1" b="2"
c="34">998</row>
    <row a="2" b="22"
c="34">999</row>
    <row a="3" b="2"
c="3">1000</row>
    <row a="4" b="24"
c="">1001</row>
    <row a="5" x="Viva Villa!"
c="34">998</row>
    </table>
(4) <p>This isn't mixed content, because the schema
says I'm a string.</p>

?



I have a solution that works ok by traversing the
input document in doc order -- but it does not handle the
siblings of repeating nodes that are not themselves
repeating.

I have thought of doing this the opposite way, get a
key of all repeating nodes and process only those at the
lowest depth to generate rows.  I haven't actually
written the logic.

I gather that the tables you want to generate have
something
to do with multiple occurrences of elements with the same
name.
Does adjacency matter, or would


<a><b/><b/><b/><c/><c/><c/></a>

be treated differently from


<a><b/><c/><b/><c/><b/><c/></a>

?  (Assume if you like, for purposes of discussion,
that the b and c
and a elements all have interesting attributes.)


Any better ideas would be welcome.

Your example reminds me of the contortions I've seen
people
go to trying to represent structured information in RFC
822
attribute-value pairs.  So the best idea I have at the
moment
is:  Save yourself!  Don't do it!

But probably you know exactly what you're doing, there is a
perfectly
reasonable algorithm for what you want, and I just haven't
understood.

hth

--****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************





--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--






--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--