xsl-list
[Top] [All Lists]

RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs

2007-06-07 01:40:29
You don't really make it clear where you are having difficulty. There seem
to be four separate problems here:

(a) translating your concepts, such as "words" and "sentences" into precise
specifications

(b) translating these specifications into regular expressions

(c) using these regular expressions within a stylesheet, for example as an
argument to the tokenize() function or the xsl:analyze-string instruction.

(d) doing the output numbering.

The fourth problem seems quite unrelated to the others. Of the other three,
I'm reluctant to launch into answering without knowing which of the three
steps you need help with. (Generally I think most people answering on this
list adopt the approach of trying to help you solve your problem, rather
than doing the work for you.)

Incidentally, regular expressions are an XSLT 2.0 feature so I assume you're
looking for XSLT 2.0 solutions.

Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: mark bordelon [mailto:markcbordelon(_at_)yahoo(_dot_)com] 
Sent: 06 June 2007 22:52
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] How to parse text into words, phrases, 
clauses, sentences, and paragraphs

Hey XML gurus,

Still somewhat new to XML/XSL and need some help getting 
started on how to use regular expressions and tokens in 
English text to transform it into an XML document marked up for:

1.words (delimited by WS, excluding any external 
2.punctuation, but allowing internal punctuation) 3.phrases 
(delimited by the comma) 4.clauses (delimited by colon or 
semicolon) 5.sentences (delimited by the period, 
question-mark, or  exclamation mark) 6.paragraphs (delimited 
by a line break)

Also ideal would be to assign sequenced id's to every tag, 
either in a running consecutive style from beginning to end, 
or repeating from 1 for every level of nesting. 

In more concrete terms,

To transfrom this text:

THOU still unravish'd bride of quietness,  Thou foster-child 
of Silence and slow Time, Sylvan historian, who canst thus 
express  A flowery tale more sweetly than our rhyme:
What leaf-fringed legend haunts about thy shap  Of deities or 
mortals, or of both,  In Tempe or the dales of Arcady?
 What men or gods are these? What maidens loth?
What mad pursuit? What struggle to escape?
 What pipes and timbrels? What wild ecstasy?

into this XML: (using indexing that renumbers for each
sub-group)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and slow Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery tale more 
sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of deities or 
mortals,</phrase>
   <phrase id=1> or of both,</phrase>
   <phrase id=2> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent>  <sent 
id=3>What maidens loth?</sent>  <sent id=4>What mad 
pursuit?</sent>  <sent id=5>What struggle to escape?</sent>  
<sent id=6>What pipes and timbrels?</sent>  <sent id=7>What 
wild ecstasy?</sent> </para>


or into this XML: (using indexing that is continuous per tag)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and slow Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery tale more 
sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of deities or 
mortals,</phrase>
   <phrase id=5> or of both,</phrase>
   <phrase id=6> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent>  <sent 
id=3>What maidens loth?</sent>  <sent id=4>What mad 
pursuit?</sent>  <sent id=5>What struggle to escape?</sent>  
<sent id=6>What pipes and timbrels?</sent>  <sent id=7>What 
wild ecstasy?</sent> </para>

Surely this has been done before. I have searched through 
archives and have not found anything, probably since I am 
searching using the wrong terminology.

Would really appreciate the help as it would give me insight 
into using regular expressions and sequencing in XSL.

Thanks in advance

Mark Bordelon



 
______________________________________________________________
______________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--