xsl-list
[Top] [All Lists]

Re: [xsl] Dealing mixed content with invalid node-like text

2011-12-04 18:15:50
If you need to read a file in a format that is not XML, then in general I would suggest you start by defining a BNF grammar for the language you want to accept, and then write a parser for that grammar using the usual parsing techniques (top-down or bottom-up) taught in every computer science course. If the language is similar to XML, then it is too complex to parse using regular expressions.

Michael Kay
Saxonica

On 04/12/2011 19:15, Karlmarx R wrote:

Hello,

I have a situation where in I need to deal mixed content text that also come 
with text wthin angle brackets, converted to XML output. For example, texts 
like:

"Sometext<xx>within valid node</xx>  and like<II .>  Title etc"
"Sometext like<1a .>  Title etc,<xx>within<b>something</b>  valid node</xx>  
etc".

Now, the output has to be like:

<nodename>Sometext<xx>within valid node</xx>  and like&lt;II .&gt; Title 
etc</nodename>
<nodename>Sometext like&lt;1a .&gt; Title etc,<xx>within<b>something</b>  valid 
node</xx>  etc</nodename>

At present I do not get things like<br/>  but assume I get so, it being valid, I should 
treat it as node. The point I am trying to make is,<II .>  and<1a .>  like non-node 
things needs to be treated removing their angle brackets to make the XML valid. Currently I use 
analyze-string with a regex to deal this, which does not work correctly (due to mistakes). But I 
would like to know whether there is good standard solution to deal with these sort of text. At 
present each line of text is passed to this template and treated like:

<xsl:template name="tag-text">
                         <xsl:param name="unparsed" required="yes"/>
                         <xsl:analyze-string select="$unparsed" 
regex="^(.*?)&lt;(.+)&gt;(.*)&lt;/(.+)&gt;(.*?)$">    <!-- this regex has flaws, in that fails 
to treat those invalid nodes -->
                                     <xsl:matching-substring>  ** do process and if 
necessary recuressively call this template again **</xsl:matching-substring>
                                     <xsl:non-matching-substring>
                                                 <xsl:value-of select="."/>
                                     </xsl:non-matching-substring>

I suspect possibly there could be a better regex to get the solution I wanted, 
but not sure whether xslt itself has better way to deal this. Pls can you 
suggest possible solutions (incl better regex if any of you used it 
successfully).

Thanks in advance,
Karl


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--




--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--