xsl-list
[Top] [All Lists]

Re: [xsl] xslt function for generating grammatical paradigms

2008-04-21 09:35:20
Imagine all word-forms of a word like the following:

a11, b11,  a12, b12
a11, b21,  a12, b22
. . . . . . . . .  . . .
a11, bN1, a12, bN2

where

the word a1 has constant parts a11 and a12

and it has N wordforms, generated using the matrice:

  b11,  b12,
  b21,  b22,
 . . . . . . .  .
  bN1, bN2

Let's give this matrice the id "InflRule-0001".

Typically there would be hundreds or thousands of words like a1, that
form all their word-forms using the *rule* InflRule-0001.

We can describe the morphology of a language if we specify all the
different inflection rules. This "morphological database" will
typically be a separate xml file that will be indexed by ruleNo.


Then, we can have a dictionary with a format like this:

stem11, stem12, RuleNoX
stem21, stem22, RuleNoY
.....................................
stemK1, stemK2, RuleNoK


In this dictionary, only the constant parts of each stem are kept,
together with the rule number, using which all word-forms for this
stem can be generated.

The function:

   generateWordforms(stemPart1, stemPart2, ruleNo)

is trivial to implement.

The inflection rules themselves, can be autogenerated, if the sets of
all word-forms of two or more words having the same inflection type
are provided:

  generateInflRule( (Word1Forms), (Word2Forms) )


So, the only manual work to be done is finding two words that are of
the same inflection type and feeding their wordforms to the generator
above.

Then the production of the dictionary can be facilitated, by a
function, which takes the main wordform of a word *a* together with
its ruleNo and generates the dictionary entry a1, a2, ruleNo

A recognizer is also straightforward to implement, taking arbitrary
wordform and returning a list of successful lookups, each of them
consisting of the main form of a word and the sequence number of the
given wordform (which provides all morphological information, such as
part of speech, number, gender, definiteness, case, person, tense,
..., etc.)

Long agoI did successfully carry out similar work  and it was possible
to build morphological analyzers for Bulgarian (with moderate
difficulty, ~ 200 rules) and Russian (much more difficult, ~ 500
rules).



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play




On Sun, Apr 20, 2008 at 7:10 PM, David J Birnbaum 
<djbpitt+xml(_at_)pitt(_dot_)edu> wrote:
Dear XSLT List,

I'm looking into developing an XSLT 2.0 stylesheet that will take a 
linguistic stem of the form XYZ- (where X, Y, and Z are the letters in the 
stem of a lexeme) and generate the full range of endings that occur on that 
word in the relevant grammatical paradigm. Writing up a set of <stem> 
elements and a set of <ending> elements and pasting together all possible 
combinations is easy enough; the problem is sandhi rules, which may cause 
both the stem-final consonant (Z in the preceding example) and the 
grammatical ending to change shape in certain circumstances. As a 
semi-hypothetical example:

1. Given stems "Zen-" and "duS-"

2. Given basic ending "y"
3. "Zen-" plus basic "y" yields "Zeny" (no changes).

4. "duS-" plus basic "y" yields "duSE" (basic "y" is replaced by "E") because 
it's a property of stem-final "S-" that it causes following grammatical 
endings that normally begin with "y" to change their first letters to "E". 
Sequences of "Sy" are fine elsewhere in words; this rule applies only at the 
juncture of stem and grammatical ending.
A brute-force solution is easy enough; just string together replace() 
functions like:

<xsl:variable name="$temp06" select="replace('$temp05','S-y','SE')"/>
(where the first rule creates $temp01, feeds it to rule that creates $temp02, 
etc., and the function ultimately returns the output of the final replace() 
operation).

This type of brute-force approach would string together dozens (possibly 
hundreds) of these rules to account for all possible sandhi modifications. 
That seems inappropriately crude because the rules actually apply to 
*classes* of letters, so that, for example, basic "y" endings are replaced by 
"E" not just after "S", but after half a dozen different consonants, as well 
as after one or two consonant clusters (that is, the last stem consonant 
isn't the trigger for the change in those cases, it's the combination of the 
last two).
What I'm groping for, then, is an elegant rule-based function that lets me 
write a small number of rules by defining classes of letters to which they 
apply, something like "after 'S', 'Z', 'C', 'St', and 'Zd', 'y' is replaced 
by 'E'." As I mention above, these rules apply only at the boundary of stem 
plus ending; "S" can be followed by "y" elsewhere in a word. Since I've 
encoded my stems with trailing hyphens, I can easily distinguish "Sy" (which 
should be left alone) from "S-y" (which should be replaced by "SE").

There is also a type of rule where the stem-final consonant changes but the 
grammatical ending doesn't, along the lines of "when 'E' follows a stem that 
ends in 'k', 'g', or 'x', that stem-final consonant changes into 'C', 'Z', 
and 'S', respectively, and the 'E' doesn't change."

Finally, there is a slightly less brute-force approach where I would create 
not just one paradigm of basic endings plus rules to change them in certain 
circumstances, but several paradigms that already incorporate the changes, 
and I would look at the last stem consonant or two and select the appropriate 
paradigm. Is such a "selection" approach more appropriate for this type of 
problem than the "modification" approach I've been contemplating?

In any case, I'd be grateful for any pointers to an elegant way of expressing 
this type of rule in XSLT.

Sincerely,

David
djbpitt+xml(_at_)pitt(_dot_)edu <mailto:djbpitt+xml(_at_)pitt(_dot_)edu>


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--