xsl-list
[Top] [All Lists]

[xsl] analyze-string question

2012-10-25 23:38:37
Dear XSLT-list,

For an up-conversion of a plain-text word-list with grammatical classification 
information to XML, I've been a file with lines like the following:

DRUG<OJ MO <MS-P <P 3V>>

The desired output is:

DRUG<stress>o</stress>J MO <alt>MS-P <alt>P 3V</alt></alt>

That there are angle brackets in the input isn't a problem; I can convert them 
easily enough to &lt; and &gt; (or anything else, for that matter). The two 
problems over which I'm stumbling are:

1. In the source document, angle brackets have two very different meanings:

a) Sometimes an angle bracket ("<" or ">") is a stand-alone (unpaired) 
diacritic that tells me that the following vowel letter is stressed. That's the 
case of the first "<" in the example above, and I want to remap it to <stress> 
tags around the stressed vowel. Both "<" and ">" may have this function; the 
former marks primary stress and the latter (which is rare) secondary stress. I 
need to tag them differently.

b) At other times angle brackets delimit an alternative grammatical 
classification, and in that case I want to remap open and close angle brackets 
to open and close <alt> tags. In the example above, the primary grammatical 
classification is the "MO" and the rest is an alternative. But ...

2. When angle brackets demarcate an alternative grammatical classification, 
they may nest. In the example above, the primary grammatical classification is 
"MO" with an alternative "MS-P <P 3V>". The alternative itself has a nested 
structure, though; within the alternative, "MS-P" is primary and "P 3V" is 
alternative.

For what it's worth, as far as I've been able to tell, there is never a stress 
within an "alt" section (that is, between angle brackets that do not represent 
stress, and that instead delimit an alternative grammatical classifier). It is 
not the case that the stressed word always comes before the grammatical 
information; there may also be stressed words later in the entry. Most entries 
do not have alternative classifiers, but many do.

Until I stumbled on the nested alternative identifiers, I was using 
<xsl:analyze-string> to match "&lt;(.+?)&gt; and replacing it with 
<alt><xsl:value-of select="regex-group(1)"/></alt>. On a subsequent pass, I 
then used <xsl:analyze-string> to match "(&lt;|&gt;)(.)", deploying 
<xsl:choose> to select tags (for primary or secondary stress) based on the 
value of regex-group(1), and then wrap the appropriate tags around 
regex-group(2). This seemed to do what I wanted.

The strategy failed with the nested alternative classifiers, though, where

<MS-P <P 3V>>

did just what I asked for, even though it wasn't what I wanted (sigh):

<alt>MS-P <P3V</alt>>

Note the internal "<" and the trailing ">". On the second pass, the one that 
was supposed handle stress, it got even worse:

<alt>MS-P <stress>P</stress>3V</alt>>

My next thought was that I wanted to process the input string the way I'm 
doing, except look for matched pairs of angle brackets (representing an 
alternative classifier) from the inside out, instead of from left to right. I 
suspect I could get that with a regex like "&lt;([^[&lt;&gt;]+)&gt;]" 
(untested, but the point is to find a "<" and a string of anything but "<" or 
">" up to the first ">"), but I don't see how to use <xsl:analyze-string> for 
that, since if the first pass were to yield:

<MS-P <alt>P 3V</alt>>

which is what I want, I don't know how to find the outer pair. If I do 
<xsl:analyze-string> on the entire preceding value (the output of a first 
pass), won't it atomize it (after all, what it's analyzing is a string), wiping 
out the internal markup? And if I try to apply <xsl:analyze-string> to the 
individual text nodes, the "<" and ">" aren't in the same text node.

I realize that this may be simpler than it appears to me, and perhaps even much 
simpler, but at the moment I'm having trouble even conceptualizing the problem 
in a way that suggests a solution. I'd be grateful for a gentle (or even 
not-so-gentle) nudge in the right direction. 

Thanks,

David
djbpitt(_at_)gmail(_dot_)com



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--


<Prev in Thread] Current Thread [Next in Thread>