xsl-list
[Top] [All Lists]

Re: [xsl] tokenize() and regex-group ?

2012-07-18 04:54:23
Hi

Just a last word to say my problem is solved, thanks for your reactive and helpfull help !

Just a few comments here :

I used a self igs:tokenize-as-xml function that doesn't loose the "regex separator" (see last mail). I just change the output of the function to be a single element with children :
<xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)">
instead of a sequence of elements :
<xsl:function name="igs:tokenize-as-xml" as="element()*">

Why ? because it's seems one can not use "axes" (preceding-sibling::, << operator ...) "very well" within a sequence, one need a context.
I actually get some strange results when using :
<xsl:variable name="textBegin" select="string-join($tokenTextAsXML[ . &lt;&lt; $myFocusElement],'')"/> (looks like a filter is added, selecting only node whose name is the same as $myFocusElement) by the way myFocusElement is defined within the reccursion by : <xsl:variable name="myFocusElement" select="$tokenTextAsXML[last() - $lookBacklevel + 1]" as="element()"/>

when I used <xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)"> and <xsl:variable name="textBegin" select="string-join($tokenTextAsXML/igs:*[ . &lt;&lt; $myFocusElement],'')"/>
everything is going fine.

Well, I tried to simplifie the explanation, hope this is understandable.
Let see the real code at the bottom of this mail.

Best Regards,
Matthieu Ricaud.

<xsl:function name="igs:tokenize-as-xml" as="element(igs:tok)">
        <xsl:param name="string" as="xs:string"/>
        <xsl:param name="regex" as="xs:string"/>
        <xsl:variable name="tmp" as="element()*">
            <xsl:analyze-string select="$string" regex="{$regex}">
                <xsl:matching-substring>
                    <igs:sep><xsl:value-of select="."/></igs:sep>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
                   <igs:text><xsl:value-of select="."/></igs:text>
                </xsl:non-matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
        <igs:tok>
<xsl:for-each-group select="$tmp" group-adjacent="local-name(.)='sep'">
                <xsl:choose>
                    <xsl:when test="current-grouping-key()">
<igs:sep><xsl:value-of select="string-join(current-group(),'')"/></igs:sep>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:copy-of select="current-group()"/>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:for-each-group>
        </igs:tok>
    </xsl:function>

(myFocusElement is called splitter in the real code)
<xsl:template name="addRefTheme">
        <xsl:param name="text" as="xs:string"/>
        <xsl:param name="lookBacklevel" select="1" as="xs:integer"/>
<xsl:variable name="tokenTextAsXML" select="igs:tokenize-as-xml($text,'(\s|\(|«\p{Z}|\p{Z}»|[lL]’)')" as="element()*"/> <!-- the text is splitted in 2 parts, one will then try to get a corresponding anchor from the 2nd one--> <xsl:variable name="tokenNum" select="count($tokenTextAsXML/igs:*)" as="xs:integer"/> <xsl:variable name="spliter" select="$tokenTextAsXML/igs:*[last() - $lookBacklevel + 1]" as="element()"/> <xsl:variable name="textBegin" select="string-join($tokenTextAsXML/igs:*[ . &lt;&lt; $spliter],'')"/> <xsl:variable name="textEnd" select="string-join($tokenTextAsXML/igs:*[. &gt;&gt; $spliter or . is $spliter],'')"/> <xsl:variable name="matchingAncres" select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]" as="element()*"/>
        <xsl:variable name="error.msg">
[ERROR][STEP7][ref:theme] <xsl:value-of select="count($matchingAncres)"/> ancre(s) trouvee(s) pour [text=<xsl:value-of select="concat($text,$asterix)"/>]<xsl:call-template name="lf"/>
            <xsl:if test="$config/@debug='1'">
[lookBacklevel=<xsl:value-of select="$lookBacklevel"/>]<xsl:call-template name="lf"/> [textBegin=<xsl:value-of select="$textBegin"/>]<xsl:call-template name="lf"/> [textEnd=<xsl:value-of select="$textEnd"/>]<xsl:call-template name="lf"/>
            </xsl:if>
        </xsl:variable>
<xsl:variable name="ref_theme_override" select="$config/igs:ref_theme_override/igs:string[normalize-space(@value)=concat(normalize-space($textEnd),$asterix)]" as="element()?"/>
        <xsl:choose>
            <xsl:when test="count($ref_theme_override)=1">
<xsl:copy-of select="$ref_theme_override/node()" copy-namespaces="no"/>
            </xsl:when>
            <xsl:when test="count($matchingAncres)=1">
                <xsl:value-of select="$textBegin"/>
<ref:theme idrefCorps="{$matchingAncres/@id}"><xsl:value-of select="concat($textEnd,$asterix)"/></ref:theme>
            </xsl:when>
            <xsl:when test="count($matchingAncres) gt 1">
<xsl:message><xsl:value-of select="$error.msg"/></xsl:message>
                <xsl:value-of select="concat($text,$asterix)"/>
            </xsl:when>
            <xsl:when test="$lookBacklevel lt $tokenNum">
                <xsl:call-template name="addRefTheme">
                    <xsl:with-param name="text" select="$text"/>
<xsl:with-param name="lookBacklevel" select="$lookBacklevel + 1"/>
                </xsl:call-template>
            </xsl:when>
            <xsl:otherwise>
<xsl:message><xsl:value-of select="$error.msg"/></xsl:message>
                <xsl:value-of select="concat($text,$asterix)"/>
            </xsl:otherwise>
        </xsl:choose>
    </xsl:template>



Le 17/07/2012 16:54, Matthieu Ricaud-Dussarget a écrit :
Thank you Michael.

At first I used <xsl:analyze-string> before I realized I have to check for one *or more* words to match the definition. I thought using tokenize would help with going back reccursively from word to word into the string (help by position() predicates). But well I did not thought about problems with differents separator pattern (the igs:match-ancre() function is permissive with this... but the output tagging is not good, eg. <link idref="#foobar">« foo bar*</link> » shall better be « <link idref="#foobar">foo bar*</link> »).

As usual you're right :-) I have to go back with <xsl:analyze-string>

The problem I suspected was about a regex witch match 1 or 2 or N words, something like $wordRegex$sepRegex{{$lookBacklevel}}

After your emphazing fn:tokenize(), I finaly started with another way of doing it with the help of af function that tokenize as XML :

<xsl:function name="igs:tokenize-as-xml" as="element()*">
        <xsl:param name="string" as="xs:string"/>
        <xsl:param name="regex" as="xs:string"/>
        <xsl:variable name="tmp" as="element()*">
            <xsl:analyze-string select="$string" regex="{$regex}">
                <xsl:matching-substring>
                    <igs:sep><xsl:value-of select="."/></igs:sep>
                </xsl:matching-substring>
                <xsl:non-matching-substring>
                   <igs:text><xsl:value-of select="."/></igs:text>
                </xsl:non-matching-substring>
            </xsl:analyze-string>
        </xsl:variable>
<xsl:for-each-group select="$tmp" group-adjacent="local-name(.)='sep'">
            <xsl:choose>
                <xsl:when test="current-grouping-key()">
<igs:sep><xsl:value-of select="string-join(current-group(),'')"/></igs:sep>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:copy-of select="current-group()"/>
                </xsl:otherwise>
            </xsl:choose>
        </xsl:for-each-group>
    </xsl:function>

Hope I can bind everything, I tell you about this when I finish.

Regards,
Matthieu.

Le 17/07/2012 15:22, Michael Kay a écrit :
You need to use xsl:analyze-string. I don't understand the difficulties in using this inside a recursive template. xsl:analyze-string can do everything that tokenize can do; you could implement tokenize as

<xsl:function name="fn:tokenize" as="xs:string">
  <xsl:param name="in" as="xs:string"/>
  <xsl:param name="regex" as="xs:string"/>
  <xsl:analyze-string select="$in" regex="{$regex}"/>
    <xsl:matching-substring/>
    <xsl:non-matching-substring>
       <xsl:sequence select="."/>
    </xsl:non-matching-substring>
</xsl:function>

Start be replacing your call to tokenize with a call to that function, then add whatever functionality you need.

Michael Kay
Saxonica

On 17/07/2012 14:02, Matthieu Ricaud-Dussarget wrote:
Hi all,

I'm tokenizing some text within a reccursiv template. The goal is to generates some linking with some "definitions" inside the doc.
Let say my text is : "my foo bar"
=> 1st level of reccursion is searching for "bar" as defined anchor in the doc
if not found, I increase a $lookBacklevel param :
=> 2nd level of reccursion is searching for "foo bar"
and so on... till it finds a matching definition or throw an error if not.
=> when a definition is found, the text is output with a link :
<p>... my <link idref="#anchorFooBar">foo bar</link> ...</p>

To do so I (space-) tokenized the text :
<xsl:variable name="tokenText" select="tokenize($text,' ')" as="xs:string*"/>

and then make 2 strings depending on reccursion param $lookBacklevel
<xsl:variable name="textBegin" select="string-join($tokenText[position() lt ($tokenNum - $lookBacklevel + 1)],' ')"/> <xsl:variable name="textEnd" select="string-join($tokenText[position() ge ($tokenNum - $lookBacklevel + 1)],' ')"/>

I then search for a matching definition :
<xsl:variable name="matchingAncres" select="$ancres[normalize-space($textEnd)!=''][igs:match-ancre(.,$textEnd)]" as="element()*"/>
(matching rules are defined in a specific function)

The problem I've got is that the tokenize separator is too specific, it's only a space, and sometime words are separated by other char like :
- unbreakable space "&#160;"
- open parenthese "("
- french quotes "«"
- ...

I could use a regex like "[\s(]«" as 2nd arg of tokenize() but, I will then not be able to reconstruct the string.

So is there a way to get the separator that has been match in the regex of tokenize() ?
just like regex-group() do when using <xsl:analyze-string> ?

I think the answer is "no", but maybe I'm missing a trick to achieve this ?

I could maybe use <xsl:analyse-string> but this is not so easy because of the reccursiv template, the regex will depend on $lookBacklevel param. I'm not sure I can fin the good pattern...

Regards,

Matthieu.




--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--





--
Matthieu Ricaud
05 45 37 08 90
IGS-CP, service livres numériques


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>