xsl-list
[Top] [All Lists]

RE: [xsl] How to parse text into words, phrases, clauses, sentences, and paragraphs

2007-06-07 07:05:07
--- Michael Kay <mike(_at_)saxonica(_dot_)com> wrote:
You don't really make it clear where you are having
difficulty. There seem
to be four separate problems here:

Mike, Thanks for helping me even break this down. THis
is definitely something I can and want to do myself.
Just need the initial hints.

(a) translating your concepts, such as "words" and
"sentences" into precise
specifications
(b) translating these specifications into regular
expressions

Got these. 
E.g. the specification for "word" could be [^ '-]*


(c) using these regular expressions within a
stylesheet, for example as an
argument to the tokenize() function or the
xsl:analyze-string instruction.


This is my first problem. How to apply a template
match ysing the tokenize() function. And which order
to apply (from paragraph -> word or word ->
paragraph).

(d) doing the output numbering.

I haven't a clue how this would be done, either way.


The fourth problem seems quite unrelated to the
others. Of the other three,
I'm reluctant to launch into answering without
knowing which of the three
steps you need help with. (Generally I think most
people answering on this
list adopt the approach of trying to help you solve
your problem, rather
than doing the work for you.)

After any initial hints, I would and could be able to
do the rest of the work myself.


Incidentally, regular expressions are an XSLT 2.0
feature so I assume you're
looking for XSLT 2.0 solutions.


That is an issue. IS there any way to do this without
regular expressions?


Michael Kay
http://www.saxonica.com/

-----Original Message-----
From: mark bordelon
[mailto:markcbordelon(_at_)yahoo(_dot_)com] 
Sent: 06 June 2007 22:52
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: [xsl] How to parse text into words,
phrases, 
clauses, sentences, and paragraphs

Hey XML gurus,

Still somewhat new to XML/XSL and need some help
getting 
started on how to use regular expressions and
tokens in 
English text to transform it into an XML document
marked up for:

1.words (delimited by WS, excluding any external 
2.punctuation, but allowing internal punctuation)
3.phrases 
(delimited by the comma) 4.clauses (delimited by
colon or 
semicolon) 5.sentences (delimited by the period, 
question-mark, or  exclamation mark) 6.paragraphs
(delimited 
by a line break)

Also ideal would be to assign sequenced id's to
every tag, 
either in a running consecutive style from
beginning to end, 
or repeating from 1 for every level of nesting. 

In more concrete terms,

To transfrom this text:

THOU still unravish'd bride of quietness,  Thou
foster-child 
of Silence and slow Time, Sylvan historian, who
canst thus 
express  A flowery tale more sweetly than our
rhyme:
What leaf-fringed legend haunts about thy shap  Of
deities or 
mortals, or of both,  In Tempe or the dales of
Arcady?
 What men or gods are these? What maidens loth?
What mad pursuit? What struggle to escape?
 What pipes and timbrels? What wild ecstasy?

into this XML: (using indexing that renumbers for
each
sub-group)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of
quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and
slow Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery
tale more 
sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of
deities or 
mortals,</phrase>
   <phrase id=1> or of both,</phrase>
   <phrase id=2> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent> 
<sent 
id=3>What maidens loth?</sent>  <sent id=4>What
mad 
pursuit?</sent>  <sent id=5>What struggle to
escape?</sent>  
<sent id=6>What pipes and timbrels?</sent>  <sent
id=7>What 
wild ecstasy?</sent> </para>


or into this XML: (using indexing that is
continuous per tag)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of
quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and
slow Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery
tale more 
sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of
deities or 
mortals,</phrase>
   <phrase id=5> or of both,</phrase>
   <phrase id=6> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent> 
<sent 
id=3>What maidens loth?</sent>  <sent id=4>What
mad 
pursuit?</sent>  <sent id=5>What struggle to
escape?</sent>  
<sent id=6>What pipes and timbrels?</sent>  <sent
id=7>What 
wild ecstasy?</sent> </para>

Surely this has been done before. I have searched
through 
archives and have not found anything, probably
since I am 
searching using the wrong terminology.

Would really appreciate the help as it would give
me insight 
into using regular expressions and sequencing in
XSL.

Thanks in advance

Mark Bordelon



 


______________________________________________________________
______________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from
Yahoo! Answers users.


http://answers.yahoo.com/dir/?link=list&sid=396546091



--~------------------------------------------------------------------
XSL-List info and archive: 
http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to:
http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive: 
http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to:
http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>

=== message truncated ===



       
____________________________________________________________________________________
Get the free Yahoo! toolbar and rest assured with the added security of spyware 
protection.
http://new.toolbar.yahoo.com/toolbar/features/norton/index.php

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--