xsl-list
[Top] [All Lists]

Re: HTML text extraction

2004-07-26 06:50:18
Hope this could help -

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:output method="xml" version="1.0"
encoding="UTF-8" indent="yes"/>
        
<xsl:template match="/root">
  <root>
    <xsl:for-each select="p[(. = 'Heading 1') or (. =
'Heading 2')]">
     <Subject>
       <xsl:value-of select="." />
       <xsl:text>&#xA;</xsl:text>
       <xsl:variable name="p-id"
select="generate-id()"/>
       <content>
         <xsl:for-each
select="following-sibling::p[generate-id(preceding-sibling::p[starts-with(.
, 'Heading')][1]) = $p-id][not(starts-with(.,
'Heading'))]">                                     <xsl:value-of select="."/>
           <xsl:text>&#xA;</xsl:text>
         </xsl:for-each>
       </content>
     </Subject>
   </xsl:for-each>
  </root>
</xsl:template>
        
</xsl:stylesheet>

Regards,
Mukul

--- Myron Bennet <vbj34(_at_)yahoo(_dot_)com> wrote:
Hello,

I am using XSL to extract text from HTML pages into
XML. I get all the text between predefined delimiter
keywords such as Heading 1 and Heading 2. The
problem
I am having is the template continues matching past
the delimiter keywords (For example I want to match
between Headings 1 and 2 only, but the template
matches between Headings 1-2 plus everything else
after Heading 2). Example input/output and the
recursive template I use are shown below. I would
appreciate any input on this. Thanks.


INPUT HTML:

<p>Heading 1</p>
<p>bbb</p>
<p>aaa</p>
<p>Heading 2</p>
<p>aaa</p>
<p>ccc</p>
<p>Heading 3</p>
...


OUTPUT XML:

<Subject>
Heading 1
<content>
bbb
aaa
</content>
</Subject>
<Subject>
Heading 2
<content>
aaa
ccc
</content>
</Subject>
<Subject>
Heading 3
<content>
...
</content>
</Subject>


RECURSIVE TEMPLATE:
<xsl:template

match="//p[starts-with(normalize-space(.),'Heading')]">
<Subject>
<xsl:value-of select="."/>
<content>
<xsl:variable name="next"

select="following-sibling::*[not(starts-with(normalize-space(.),
'Heading'))]"/>
<xsl:if test="$next">
<xsl:apply-templates select="$next"
mode="getContent"
/>
</xsl:if>                    
</content>          
</Subject>
</xsl:template> 
        
<xsl:template name="getContent">
<xsl:value-of select="."/>
<xsl:variable name="next"

select="following-sibling::*[not(starts-with(normalize-space(.),
'Heading'))]"/>            
<xsl:if test="$next">
<xsl:apply-templates select="$next"
mode="getContent"
/>                           
</xsl:if>            
</xsl:template>



                
__________________________________
Do you Yahoo!?
Yahoo! Mail Address AutoComplete - You start. We finish.
http://promotions.yahoo.com/new_mail 


<Prev in Thread] Current Thread [Next in Thread>