xsl-list
[Top] [All Lists]

RE: Ingoring HTML - A Solution

2005-06-21 08:01:38
I thought I'd post a solution to my request last week to remove "HTML tags" from
a block of XML.  There may be a better way to do this, but this seems to work in
my case. Thanks for everyone's input.

<xsl:template name="strip-HTML">
    <xsl:param name="text"/>
    <xsl:choose>
        <xsl:when test="contains($text, '&gt;')">
            <xsl:choose>
                <xsl:when test="contains($text, '&lt;')">
                    <xsl:value-of select="substring-before($text, '&lt;')"/>
                </xsl:when>
                <xsl:otherwise>
                    <xsl:value-of select="substring-before($text, '&gt;')"/>
                </xsl:otherwise>
            </xsl:choose>
            <xsl:call-template name="strip-HTML">
                <xsl:with-param name="text" select="substring-after($text,
'&gt;')"/>
            </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
            <xsl:value-of select="$text"/>
        </xsl:otherwise>
    </xsl:choose>
</xsl:template>

Jay

| Jay Burgess [Vertical Technology Group]
| "Essential Technology Links via RSS"
| http://www.vtgroup.com/

Re: [xsl] Ingoring HTML
Subject: Re: [xsl] Ingoring HTML
From: "Sam D. Chuparkoff" <sdc(_at_)xxxxxxxxxx>
Date: Fri, 17 Jun 2005 13:39:59 -0700

On the dangerous side, I'd try something like:

perl -ne '$c.=$_;eof&&($c=~s/&lt;(([^<>](?!&lt;))*?)&gt;//sg&print$c);'
foo.xml

Because it will probably be fine. For extra danger points, you can put
it in a Makefile with no comment.

You should be able to do something similar with xsl, but of course this
isn't very safe, and I think it would be a lot more complicated.

s/&lt;(([^<>](?!&lt;))*?)&gt;//sg;

This is '&lt;' some text '&gt;' with no intervening '&lt;', '<', or '>'
replaced with nothing. I thought about actually trying to turn this
content into xml, but note there's no close quote on that style
attribute! Watch out!

sdc

On Fri, 2005-06-17 at 15:13 -0500, Jon Gorman wrote:
On 6/17/05, Jay Burgess <lists(_at_)xxxxxxxxxxx> wrote:
I apologize if this is in the FAQ, but I've searched and can't find it.  
(I'm
kind of new to XSL, so I may just have not seen it.)

This is a faq of sorts, but I had a little bit of a difficult time
finding an answer to it in Dave Pawson's FAQ as well.  Of course, I
just did a quick glance.  I'd recommend skimming the the CDATA section
as well.


I've got some XML that contains HTML-formatted text.  For example:

<title>&lt;SPAN style="font-size: 13pt; font-family: Verdana; &gt;The
&lt;b&gt;Text&lt;/b&gt; That I Want&lt;/SPAN&gt;</title>


"HTML-formatted text" is a little bit nonsensical.  HTML itself says
that &lt; is meant as a stand-in for <, so when you have it it's not a
tag.  Since namespaces were rather slow to get off to start, we ended
up seeing people put so-called "HTML" in XML *cough* RSS *cough*.  But
to any XML application, this is one big chunk of text.

So, some possible advice:

1) if you can change the input format so that it uses namespaces and
actually embeds real XHTML into the documents you're creating, do so. 
Or at least have it be an option.

2) If you can't do that, I'm sure you can find a more general solution
if you hunt through the archives.  The essential solution will
probably be along the lines of looking for &lt; and &gt;s and throwing
any text in them out via some of the XPATH/XSLT string functions. 
Might be much easier with XSLT 2.0

3) It may be possible with a combination of d-o-e and doing multiple
transformations, regex scripting or other techniques to replace the
various &lt; and &gt; in certain elements but not others, then
reprocess that document through your final stylesheet.  Of couse, this
makes it slightly dangerous.

Dig through the archives there might be a more general solution
already done or someone else will be able to give you one instead of
just giving you some ranting.  (I blame Friday afternoon and a slow
server for my current long-winded explanation why this type of
embedding is evil).

Short answer, it's probably not difficult as long as it's relatively
straightforward.  If the "html" inside the xml is complex at all or
you are using &lt; in other places, you might have difficulty.

Extremely simple if you can just have the input source use namespaces
and you're comfortable with how XSLT deals with namespaces.

Jon Gorman





--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



<Prev in Thread] Current Thread [Next in Thread>
  • RE: Ingoring HTML - A Solution, Jay Burgess <=