xsl-list
[Top] [All Lists]

Re: [xsl] CALS table model - finding 'straddle' rows

2014-02-26 06:38:01

It sounds like a positional grouping problem to me:

<xsl;template match="rbody">
  <xsl:for-each-group select="row" group-starting-with="row[entry [@nameend 
&gt; @namest]][1]">


Michael Kay
Saxonica




On 26 Feb 2014, at 12:15, Feargal Hogan 
<feargal(_dot_)hogan(_at_)gmail(_dot_)com> wrote:


Hi all
I am building an XSLT process to go through a large collection of XML files, 
looking for CALS model tables (most of the files contain at least one) and 
then store the tabular data back to a database. 

In the first instance, many of these tables will have dissimilar structures, 
but I want to use the database to analyse commonality of structure. I list 
below an (incomplete) extract from an example file.
<table frame="none">
<tgroup cols="6" colsep="0" rowsep="0">
<colspec colname="1" colnum="1" colwidth="127pt" align="center"/>
<colspec colname="2" colnum="2" colwidth="39pt" align="center"/>
<colspec colname="3" colnum="3" colwidth="30pt" align="center"/>
<colspec colname="4" colnum="4" colwidth="33pt" align="center"/>
<colspec colname="5" colnum="5" colwidth="33pt" align="center"/>
<colspec colname="6" colnum="6" colwidth="87pt"/>
<thead>
   <row valign="bottom">
       <entry align="center">Product</entry>
       <entry>SKU</entry>
       <entry>Length</entry>
       <entry>Depth</entry>
       <entry align="center">Weight</entry>
       <entry align="center">Remarks</entry>
   </row>
   <row valign="bottom">
       <entry></entry>
       <entry></entry>
       <entry>(m)</entry>
       <entry>(m)</entry>
       <entry align="center">(kg) </entry>
       <entry align="center"> </entry>
   </row>
</thead>
<tbody>
   <row>
       <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
   </row>
   <row>
       <entry>Albion</entry>
       <entry>12345</entry>
       <entry>398</entry>
       <entry>15.5</entry>
       <entry> </entry>
       <entry>N/A </entry>
   </row>
   <row>
       <entry>Rotorua</entry>
       <entry>12346</entry>
       <entry>398</entry>
       <entry>15.5</entry>
       <entry> </entry>
       <entry> </entry>
   </row>
   <row>
       <entry>Quintep</entry>
       <entry>12347</entry>
       <entry>398</entry>
       <entry>15.5</entry>
       <entry> </entry>
       <entry> </entry>
   </row>
...

Because of the dissimilar structures that I know I will encounter during the 
process, I am unable to create a table schema in the database that will hold 
all this data, other than to store each table cell as an entity with the 
following properties:
      • doc_id
      • table_id
      • row_id
      • col_id
      • col_name
      • col_units
      • entry_value
      
This will allow me to store both string and numeric values as strings.

An additional property that is required in many instances is the 'category' 
as defined in the straddle row in the example above at tbody/row[1]. 

Here the original document creator has added a straddle to categorise the 
rows immediately following:
<row>
   <entry align="left" namest="1" nameend="6"><hd4>Whites</hd4></entry>
</row>
These straddle rows are causing me some difficulties. Where they occur, they 
'categorise' the rows following UNTIL the next straddle occurs.

My initial document analysis has indicated that there are a number of 
possible 'types' for the table structure in relation to these straddles:
      • Type 1. That no straddles occur in the tables - simple and easy to 
process
      • Type 2. That the first row in the tbody is a straddle and that there 
are zero or more further straddles below this in the tbody
      • Type 3. That the table contains straddles but NOT in the first row of 
the tbody
Type 3 'could' be treated as 2 separate tables, one of type 1 (all the rows 
up to but excluding the 1st straddle row) and a Type 2 (all the rows from the 
1st straddle forwards)

So it seems that the key to solving this processing problem is to identify 
the position of the 1st straddle, treat everything (zero or more rows) before 
the straddle as Type1 and treat everything from the straddle forward as Type 
2.

But I am having some difficulty identifying the position of the 1st straddle. 
My definition of 1st straddle - in Xpath terms - is
tbody/row[entry [@nameend &gt; @namest]][1]
This allows for the possibility that the straddle is not always keyed from 
column 1 and does not always extend into the last column. 
Both of these possibilities do exist in the real world data.

There are many similar solutions listed on this page 
http://www.dpawson.co.uk/xsl/sect2/flatfile.html#d5010e13

But I am having difficulty applying them to my instances.

Something like this may work
<xsl:key name="straddles" match="row[entry[@nameend &gt; @namest]]" 
use="??????"/>
But I'm unsure what to use to define the @use attribute of the key?

When I try to define a first-straddle variable, I don't have a defining value 
to pass to the key() function?
<xsl:variable name="first-straddle" 
select="table/tgroup/tbody/row[generate-id() = 
generate-id(key('straddles',?????))]"/>
How do I find the the location of the first straddle?
What XPath statement accurately locates it?
Thanks in advance
Feargal
--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


<Prev in Thread] Current Thread [Next in Thread>