xsl-list
[Top] [All Lists]

Re: [xsl] Dealing mixed content with invalid node-like text

2011-12-06 18:22:19
If the text is "almost" XML, perhaps the easiest thing to do would be
to fix it so it really is XML, then use a character map to output it
as-is so your second pass can just parse it normally.  If all you need
to do is escape the angle-brackets in something like "<1a .>", your
"tag-text" template could be as simple as:

<xsl:value-of select="replace($unparsed, '&lt;(\S+\s+\.)&gt;',
'&amp;lt;$1&amp;gt;')"/>

And you would have declarations such as this at the top level:

<xsl:output method="xml" version="1.0" encoding="utf-8"
use-character-maps="xmlout"/>
<xsl:character-map name="xmlout">
  <xsl:output-character character="&lt;" string="&lt;"/>
  <xsl:output-character character="&gt;" string="&gt;"/>
  <xsl:output-character character="&amp;" string="&amp;"/>
</xsl:character-map>

If you have other content being produced in the first pass, whose
correct output is threatened by this mapping, you may need to do some
additional replacements in your "tag-text" template, substituting
arbitrary characters (such as characters from the Unicode Private Use
area) for less-than, greater-than and ampersand, then adjusting the
character-map to map them back to their original forms.

This sort of markup hacking is not a road I'd recommend going down,
but if you have to do it, I can't really see a reason to do it in some
other language, if XSLT is what you're comfortable with.  Michael made
a good point about using a proper parser (which I wouldn't implement
in XSLT, as a first choice, even though it would be possible) if you
can put together a proper grammar for your input, but if a few regex
substitutions can get you safely to clean XML, the above approach may
suffice.

-Brandon :)


On Tue, Dec 6, 2011 at 5:42 PM, Karlmarx R <karlmarxr(_at_)yahoo(_dot_)com> 
wrote:
Hello David,

Yes, I do process the content in 2 stages, preprocess into one form of XML 
and then further process that to my final XML form. BUT, BOTH are done in XSL 
with one signle file and the problem that I reported is in first stage 
conversion itself. To make things even more clear, here is a rough skeleton 
and explanation of my process.I get the entire content of the input into a 
variable $input-text, and then tokenize it to get each line of data into 
another variable, as below.

<xsl:variable name="lines" select="tokenize($input-text, '\r?\n')"/>

<!--then pass it to another template to process each line of data:-->
<xsl:call-template name="process-lines">
                <xsl:with-param name="lines" select="$lines"/>
</xsl:call-template>

<!-- And here, I  further process it to select the REQUIRED value, -->
<xsl:template name="process-lines">
                                <xsl:param name="lines" as="xs:string*"/>

                                <xsl:for-each select="$lines">
                                                <xsl:variable 
name="line-components" select="tokenize(.,'\t')"/>

                                                  <xsl:for-each 
select="$line-components[position() = last()]">
                                                             <value>
                                                                         
<xsl:call-template name="tag-text">
                                                                                     
 <xsl:with-param name="unparsed" select="."/>
                                                                          
</xsl:call-template>
                                                              </value>
                                                  </xsl:for-each>


<!-- AND IT IS HERE in this "ag-text" template, I try to achieve  what I 
explained in my original posting    -->
 <xsl:template name="tag-text">
       <xsl:param name="unparsed" required="yes"/>
         <xsl:analyze-string select="$unparsed" 
regex="^(.*?)<(.+)>(.*)</(.+)>(.*?)$">

       etc as posted earlier.

The skeleton input will be like (as I mentioned before):

Line one text <b>within valid node</b> and like <II .> Title etc
Line two with <1a .> Title etc, <i>within</i> <b>something</b> etc
another line can be just normal text
....

And it is vital here I get the data in the way I wanted, so that out final 
output in stage two is correct. And inview of this I cannot use <value-of 
select with d-o-e> here. As it seems this cannot be acheived by XSL (looks 
likely) I am trying to get my source corrected. But if there are solution 
available, in xsl or with better regex, I would be happy to use. I hope the 
above clarifies your question.

Thanks,
Karl

--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--