xsl-list
[Top] [All Lists]

HTML text extraction

2004-07-25 22:23:15
Hello,

I am using XSL to extract text from HTML pages into
XML. I get all the text between predefined delimiter
keywords such as Heading 1 and Heading 2. The problem
I am having is the template continues matching past
the delimiter keywords (For example I want to match
between Headings 1 and 2 only, but the template
matches between Headings 1-2 plus everything else
after Heading 2). Example input/output and the
recursive template I use are shown below. I would
appreciate any input on this. Thanks.


INPUT HTML:

<p>Heading 1</p>
<p>bbb</p>
<p>aaa</p>
<p>Heading 2</p>
<p>aaa</p>
<p>ccc</p>
<p>Heading 3</p>
...


OUTPUT XML:

<Subject>
Heading 1
<content>
bbb
aaa
</content>
</Subject>
<Subject>
Heading 2
<content>
aaa
ccc
</content>
</Subject>
<Subject>
Heading 3
<content>
...
</content>
</Subject>


RECURSIVE TEMPLATE:
<xsl:template
match="//p[starts-with(normalize-space(.),'Heading')]">
<Subject>
<xsl:value-of select="."/>
<content>
<xsl:variable name="next"
select="following-sibling::*[not(starts-with(normalize-space(.),
'Heading'))]"/>
<xsl:if test="$next">
<xsl:apply-templates select="$next" mode="getContent"
/>
</xsl:if>                    
</content>          
</Subject>
</xsl:template> 
        
<xsl:template name="getContent">
<xsl:value-of select="."/>
<xsl:variable name="next"
select="following-sibling::*[not(starts-with(normalize-space(.),
'Heading'))]"/>            
<xsl:if test="$next">
<xsl:apply-templates select="$next" mode="getContent"
/>                           
</xsl:if>            
</xsl:template>



        
                
__________________________________
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!
http://promotions.yahoo.com/new_mail 


<Prev in Thread] Current Thread [Next in Thread>