xsl-list
[Top] [All Lists]

RE: [xsl] segmenting a paragraph

2007-10-02 01:37:26
When you need to apply regex matching to text that crosses node boundaries,
in the past two approaches have been proposed:

(a) create a string in which the node boundaries are represented by some
recognizable textual markup (you could use saxon:serialize()), then apply
the regex processing, then reinstate the node structure (e.g. by using
saxon:parse()).

(b) do a deep copy, while processing each of the text nodes to replace the
significant features (such as end of sentence) by nodes (e.g. an
<end-of-sentence/> element). Then apply positional grouping techniques to
transform this into your target structure.

Neither is particularly easy, I'm afraid.

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: Christian Wittern [mailto:cwittern(_at_)gmail(_dot_)com] 
Sent: 02 October 2007 09:05
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] segmenting a paragraph

Dear XSL-list readers,

In trying to solve the following problem I am seeking your help:
I want to segment paragraphs in a text, so that sentences are 
enclosed in a <s> element and within the sentences, words 
between interpunction are within <seg> elements.

So far, I have been capturing the content of <p> in a string 
and then using two nested <xsl:analyze-string> blocks with 
regexes, which work nicely and do what I want.  Now I 
discovered that there are <note> elements with additional 
markup in some paragraphs, which get lost in this process. 
However, I really want to leave these notes alone, as they are.  So:

<p>Some text.  Some more text, with a comma. <note>This 
stuff, how boring</note></p>

should look like:

<p><s><seg>Some text.</seg></s><s><seg>Some more 
text,</seg><seg> with a comma.</seg></s><note>This stuff, how 
boring</note></p>

I wonder how I tell the processor to leave the note stuff alone?

Any help appreciated,

Christian

--
  Christian Wittern
  Institute for Research in Humanities, Kyoto University
  47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>