On 31/10/2011 12:05, Mark wrote:
Hi Ken and Michael.
Since I have already removed punctuation and substituted a space for
the hyphens, I set up my regex expression as: '\s+'. I think that is
correct to tokenize a string of words separated by blanks, as mine are.
Using this input:
<Text lang="cz" data="Jaroslav Hašek 1883 1923" title="Czechoslovak
Stamp 2575" ref="1983-2575.htm"/>
<Text lang="cz" data="UNESCO" title="Czechoslovak Stamp 2575"
ref="1983-2575.htm"/>
I tried Michael's idea with the following code:
<xsl:for-each-group select="Text" group-by="tokenize(@data,'\s+')">
<xsl:for-each select="current-group()">
<xsl:sort select="current-grouping-key()" lang="cz"/>
<Word title="{@title}" ref="{@ref}">
<xsl:value-of select="."/>
</Word>
</xsl:for-each>
</xsl:for-each-group>
And received the warning: "Sort key will have no effect because its
value does not depend on the context item"
Sorry, I was careless. Try this:
<xsl:template match="doc">
<xsl:for-each-group select="Text" group-by="tokenize(@data,'\s+')">
<xsl:sort select="current-grouping-key()" lang="cz"/>
<xsl:for-each select="current-group()">
<Word title="{@title}" ref="{@ref}">
<xsl:value-of select="current-grouping-key()"/>
</Word>
</xsl:for-each>
</xsl:for-each-group>
</xsl:template>
that gives me:
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm">1883</Word>
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm">1923</Word>
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm">Hašek</Word>
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm">Jaroslav</Word>
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm">UNESCO</Word>
(but I don't understand why the incorrect version gave you only two Word
elements)
Michael Kay
Saxonica
And the output:
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm"/>
<Word title="Czechoslovak Stamp 2575" ref="1983-2575.htm"/>
I expected this to produce five <Word> elements: 'Jaroslav', 'Hašek',
'1883' , '1923', and 'UNESCO', but only two were produced and the
<xsl:value-of> returns nothing. Is my tokenize returning nothing? I
clearly did something wrong, but cannot see what it is. I'll try Ken's
coding next, but would like to know what I did wrong.
As you surmised, no context is needed. I am collecting my <Text>
elements from a source XML file that, when my other stylesheets are
applied, will generate the documents described in the @title and @ref
attributes - i.e., I am indexing data that will in the future be
located in the described documents, they themselves do not yet exist.
Ken:
With respect to the code you gave me yesterday, my understanding is
that
"distinct-values((//@czech)/tokenize(translate(normalize-space(.),'-,$%.#','
')) )" would give me all the unique Czech words in my source document
at once, but since the documents I am indexing do not yet exist,
getting the title and href of the indexed words in this instance would
be problematic. That is why I chose to construct the <Text> elements
from my source document instead. The key idea here is that my index
does not refer to the source document itself, but to documents that
will come into existence later.
Mark
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--