Re: [xsl] Using XSLT to build an index

At 2011-10-30 14:47 -0700, Mark wrote:

The list archives did not seem to contain anXSLT stylesheet that could index an XML file,but I may have missed it. Is it practical towrite my own XSLT 2 indexing stylesheet? If so,I have a bilingual XML file that I want to index.

Where you simply want all words, except your stopwords, collected to automate the indexgeneration, I've never been successful withautomated indexing myself. For my books I'veauthored the components of the index, and thenpointed to those components from within the code.

My assumptions are that I must get rid of thepunctuation properly, then isolate the words,sort them, remove stop words, and so on. To getstarted, I need a bit of help. All of thephrases are found in two attributes: @czech and @eng.
Three questions:
(1) I am aware from Michaelâ??s book that regexexpressions may be used in the replace()function, but I do not know how to write thatregex expression. I would like to remove all thepunctuation from a phrase as follows: foreverything except a hyphen [-], replacementshould be with an empty string; the hyphenshould be replaced with a single space.

Simple character removal can be done withtranslate() in XSLT 1 or 2 rather than using a regular expression:


    translate($inValue,'-,#.$%',' ')

... where the first argument is your input, thesecond starts with a "-" and then you putanything else in there as characters to remove,the third indicates the hyphen becomes a space and the rest are to be removed.

(2) I assume that to get rid of extra spaces (ifany), I can use a construct like:normalize-space(replace(@czech, â??some regex expressionâ??)).


That will reduce all sequences of white-space characters to a single space.

(3) I assume thattokenize(normalize-space(replace(@czech, 'someregex expression'))) will permit me to write outa list of the words found in those attributes toan XML document. I am not completely clear as towhat tokenize() returns, or how to access that return.


tokenize() returns a sequence.  But the input is only a single string.

Actually, you want to turn the expressioninside-out to get a list of words from the entiredocument then something along these lines should work:

distinct-values((//@czech)/tokenize(translate(normalize-space(.),'-,$%.#',' ')) )

That gives you a sequence of unique words. Canyou work from that in order to do thehyperlinking, or do you need help there aswell? Remember you will have to do the sametranslation when creating your links, so perhapsyou should have a user function:


  mark:words(.)  as  tokenize(translate(normalize-space($arg),'-,$%.#',' '))

... then use:

  (//@czech)/mark:words(.)

... then when creating your links you'll have thefunction available to ensure the same tokenizing is done at the point in time.


I hope this helps.

. . . . . . . . . . Ken


--
Contact us for world-wide XML consulting and instructor-led training
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman(_at_)CraneSoftwrights(_dot_)com
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--