xsl-list
[Top] [All Lists]

[xsl] Grouping and sorting using custom collation class with Saxon

2010-03-23 16:10:00
I have a built a custom collation and there are a number of
multigraphs in the language I am working in. Here is a sampling of the
sort sequence (minus non-ASCII characters) from the java collation
class.

        ("='-';'=';'*' " + /** -,=,* are used to indicate various types of
affixes and clitics. These should be ignored.*/
        "< a,A " +
        "< '''a,'''A " + /** 'a,'A*/
        "< aa,Aa " +
        "< b,B " +
        "< c,C " +
        "< d,D " +
        "< dz,Dz " +
        "< e,E " +
        "< '''e,'''E " + /** 'e,'E*/
        "< ee,Ee " +
        "< f,F " +
        "< g,G " +
        "< gw,Gw " +
        "< gy,Gy " +
        "< h,H " +
        "< i,I " +
        "< '''i,'''I " + /** 'i,'I*/
        "< ii,Ii " +
        "< k,K " +
        "< k'''K''' " + /** k',K'*/
        "< kw,Kw " +
        "< ky,Ky " +
        "< k'''w,K'''w " +  /** k'w,K'w */
        "< k'''y,K'''y " +  /** k'y,K'y */
        "< l,L " +
        etc.
        "< '''y,'''Y ")

Desired output is something like this:

a,A
**********
-ana
atata

'a,'A
**********
'ap
'atata

etc.

k,K
**********
kaba
kopii
ks=
-ks
ksa

k',K'
*********
k'aba
k'ol

kw,kW
*********
kwduun
kwtaxs

k'w,K'w
*********
k'was
k'wiss
kwiloolag


The source XML structure for each entry looks like this:

<dictionary>
<entry>
    <lexical-unit>
        <form lang="tsi"><text>kaba=</text></form>
    </lexical-unit>
    <trait name="morph-type" value="proclitic"/>
    <sense>
        <grammatical-info value="prenominal"/>
        <gloss lang="en"><text>small</text></gloss>
    </sense>
</entry>
<!--more entries ....->
</dictionary>

Any suggestions as to how to most efficiently group the data according
to the parameters of the custom collation?

Currently, I manually build a regular expression, putting the largest
multigraphs first so that the greedy regex parser chooses the longest
matching string. I use this with xsl:analyze-string to add
@alphaGroupKey to each entry as shown below.

 <xsl:attribute name="alphaGroupKey">
   <xsl:analyze-string select="lexical-unit/form[(_at_)lang='tsi']/text"
     regex="^[-=]*((aa|Aa)|(a|A)|(kw|Kw)|(ky|Ky)|(k|K)|(ḵ|Ḵ))"
     
default-collation="http://saxon.sf.net/collation?class=com.lhtrees.xslt.LangXCollation;";>
      <xsl:matching-substring>
        <xsl:analyze-string select="." regex="[^-=\*]+$">
          <xsl:matching-substring>
            <xsl:value-of select="."/>
          </xsl:matching-substring>
        </xsl:analyze-string>
      </xsl:matching-substring>
   </xsl:analyze-string>
 </xsl:attribute>

I can then do the grouping of entries using for-each-group with the
attribute alphaGroupKey.

But I am wondering if there is a pre-existing way to use the custom
collation class to do the grouping so I don't need to build the regex
string. It seems like all of the information that is needed is already
in that class.

Larry

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>