Hello,
I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).
I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.
My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.
The following works but is rather slow:
countries.xml
<countries>
<country code="ABW">
<fr>Aruba</fr>
<en>Aruba</en>
</country>
<country code="AFG">
<fr>Afghanistan</fr>
<en>Afghanistan</en>
</country>
...
</countries>
data.xml
<workbook>
<sheet>
<name><![CDATA[Figure 1.1 (I)]]></name>
<row number="0">
<col number="0"><![CDATA[United Kingdom]]></col>
</row>
<row number="1">
<col number="0"><![CDATA[Part I. ]]></col>
<col number="1"><![CDATA[These data apply to France, Germany and
a couple of other countries.]]></col>
...
</row>
...
</sheet>
</workbook>
extract.xsl
<xsl:for-each select="document($country-file)/countries/country/en">
<xsl:variable name="current-node" select="."/>
<xsl:if test="$data-doc//col[matches(., $current-node/text())]">
<country><xsl:value-of select="$current-node/../@code"/></country>
</xsl:if>
</xsl:for-each>
In order to speed up the process I was thinking about indexing all
data cells using xsl:key. However, I cannot see how the key() and the
matches() function can be combined to use the former's speed with the
latter's regex power.
I was hoping of doing something along these lines, but would need some
help as this doesn't currently work:
<xsl:key name="cell" match="col" use="text()"/><!-- create an index of
the cells' contents -->
<xsl:for-each select="document($country-file)/countries/country/en">
<xsl:variable name="current-node" select="."/><!-- don't lose the
current node -->
<xsl:for-each select="document($data-file)"><!-- change context to
data document -->
<!-- key returns a nodeset, so count the number of nodes in the nodeset.
this doesn't work if the country name is not the only content -->
<xsl:if test="count(key("cell", $current-node)) > 0">
<country><xsl:value-of select="$current-node/../@code"/></country>
</xsl:if>
</xsl:for-each>
</xsl:for-each>
Maybe there's another solution that is more elegant and more efficient
than what I've shown above. I'd love to know about it. Thank you in
advance for your help.
Jakob.
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--