xsl-list
[Top] [All Lists]

[xsl] Extraction of data using key() and matches()

2010-06-05 14:03:07
Hello,

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.

The following works but is rather slow:

countries.xml

<countries>
  <country code="ABW">
    <fr>Aruba</fr>
    <en>Aruba</en>
  </country>
  <country code="AFG">
    <fr>Afghanistan</fr>
    <en>Afghanistan</en>
  </country>
  ...
</countries>

data.xml

<workbook>
  <sheet>
    <name><![CDATA[Figure 1.1 (I)]]></name>
    <row number="0">
      <col number="0"><![CDATA[United Kingdom]]></col>
    </row>
    <row number="1">
      <col number="0"><![CDATA[Part I. ]]></col>
      <col number="1"><![CDATA[These data apply to France, Germany and
a couple of other countries.]]></col>
     ...
    </row>
   ...
  </sheet>
</workbook>

extract.xsl

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/>
  <xsl:if test="$data-doc//col[matches(., $current-node/text())]">
    <country><xsl:value-of select="$current-node/../@code"/></country>
  </xsl:if>
</xsl:for-each>


In order to speed up the process I was thinking about indexing all
data cells using xsl:key. However, I cannot see how the key() and the
matches() function can be combined to use the former's speed with the
latter's regex power.

I was hoping of doing something along these lines, but would need some
help as this doesn't currently work:

<xsl:key name="cell" match="col" use="text()"/><!-- create an index of
the cells' contents -->

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/><!-- don't lose the
current node -->
  <xsl:for-each select="document($data-file)"><!-- change context to
data document -->
    <!-- key returns a nodeset, so count the number of nodes in the nodeset.
          this doesn't work if the country name is not the only content -->
    <xsl:if test="count(key("cell", $current-node)) > 0">
      <country><xsl:value-of select="$current-node/../@code"/></country>
    </xsl:if>
  </xsl:for-each>
</xsl:for-each>

Maybe there's another solution that is more elegant and more efficient
than what I've shown above. I'd love to know about it.  Thank you in
advance for your help.

Jakob.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>