Re: [xsl] Using XSLT to build an index

It's probably better to use xsl:for-each-group for this rather thandistinct-values(), since it retains more context.


You can then do

<xsl:for-each-group select="Text" group-by="tokenize(@data,$myTokenizingRegex)">

<xsl:for-each select="current-group()">
<xsl:sort select="current-grouping-key()" lang="cz"/>
<Word title="{@title}" ref="{@ref}"><xsl:value-of select="."/></Word>
</xsl:for-each>
</xsl:for-each-group>

Note that xsl:for-each-group puts an element in more than one group ifthe group-by expression returns more than one value in its result.


Michael Kay
Saxonica

On 31/10/2011 05:29, Mark wrote:

I have now normalized and isolated every phrase I wish to index into afew thousand structures similar to:
<Text lang="en" data="Zlutice Hymnal 1558" title="Czech Republic Stamp664" ref="2010-664.htm"/>
and want to break the @data attribute string into into individualwords associated with its title and ref attributes. How do I use"distinct-values(tokenize(@data))" to construct a sequence of <Word>elements from the <Text> element similar to the following? That is, Idon't see how to get at the words returned fromdistinct-values(tokenize(@data)) one at a time to do this.
<Word title="Czech Republic Stamp 664" ref="2010-664.htm">Zlutice</Word>
<Word title="Czech Republic Stamp 664"  ref="2010-664.htm">Hymnal</Word>
<Word title="Czech Republic Stamp 664"  ref="2010-664.htm">1558</Word>


Thanks, Mark






-----Original Message----- From: G. Ken Holman
Sent: Sunday, October 30, 2011 3:07 PM
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Using XSLT to build an index

At 2011-10-30 14:47 -0700, Mark wrote:
The list archives did not seem to contain an XSLT stylesheet thatcould index an XML file, but I may have missed it. Is it practical towrite my own XSLT 2 indexing stylesheet? If so, I have a bilingualXML file that I want to index.
Where you simply want all words, except your stop
words, collected to automate the index
generation, I've never been successful with
automated indexing myself.  For my books I've
authored the components of the index, and then
pointed to those components from within the code.
My assumptions are that I must get rid of the punctuation properly,then isolate the words, sort them, remove stop words, and so on. Toget started, I need a bit of help. All of the phrases are found intwo attributes: @czech and @eng.
Three questions:
(1) I am aware from Michaelâ€™s book that regex expressions may beused in the replace() function, but I do not know how to write thatregex expression. I would like to remove all the punctuation from aphrase as follows: for everything except a hyphen [-], replacementshould be with an empty string; the hyphen should be replaced with asingle space.
Simple character removal can be done with
translate() in XSLT 1 or 2 rather than using a regular expression:

    translate($inValue,'-,#.$%',' ')

... where the first argument is your input, the
second starts with a "-" and then you put
anything else in there as characters to remove,
the third indicates the hyphen becomes a space and the rest are to beremoved.
(2) I assume that to get rid of extra spaces (if any), I can use aconstruct like: normalize-space(replace(@czech, â€˜some regexexpressionâ€™)).
That will reduce all sequences of white-space characters to a singlespace.
(3) I assume that tokenize(normalize-space(replace(@czech, 'someregex expression'))) will permit me to write out a list of the wordsfound in those attributes to an XML document. I am not completelyclear as to what tokenize() returns, or how to access that return.
tokenize() returns a sequence.  But the input is only a single string.

Actually, you want to turn the expression
inside-out to get a list of words from the entire
document then something along these lines should work:

distinct-values(
(//@czech)/tokenize(translate(normalize-space(.),'-,$%.#',' '))  )

That gives you a sequence of unique words.  Can
you work from that in order to do the
hyperlinking, or do you need help there as
well?  Remember you will have to do the same
translation when creating your links, so perhaps
you should have a user function:
mark:words(.) astokenize(translate(normalize-space($arg),'-,$%.#',' '))
... then use:

  (//@czech)/mark:words(.)

... then when creating your links you'll have the
function available to ensure the same tokenizing is done at the pointin time.
I hope this helps.

. . . . . . . . . . Ken


--
Contact us for world-wide XML consulting and instructor-led training
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman(_at_)CraneSoftwrights(_dot_)com
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--