xsl-list
[Top] [All Lists]

Re: [xsl] Extraction of data using key() and matches()

2010-06-05 17:02:33
On Sat, Jun 5, 2010 at 23:42, Michael Kay <mike(_at_)saxonica(_dot_)com> wrote:
On 05/06/2010 20:02, Jakob Fix wrote:

Hello,

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.



You could create an index on all the "words" in the text using

<xsl:key name="words" match="col" use="tokenize(., '\P{L}+')"/>

where a word is defined as a maximal sequence of "letter" characters.

Then to see whether a given country is present you could start by testing
whether the first word of the country name is present:

key('words', tokenize($country, '\P{L}+')[1])

and then apply a more sensitive test to the result of this first filter.

Michael Kay
Saxonica


Thanks Michael, I'll give this a try.

Jakob.

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>