If the text is "almost" XML, perhaps the easiest thing to do would be
to fix it so it really is XML, then use a character map to output it
as-is so your second pass can just parse it normally. If all you need
to do is escape the angle-brackets in something like "<1a .>", your
"tag-text" template could be as simple as:
<xsl:value-of select="replace($unparsed, '<(\S+\s+\.)>',
'&lt;$1&gt;')"/>
And you would have declarations such as this at the top level:
<xsl:output method="xml" version="1.0" encoding="utf-8"
use-character-maps="xmlout"/>
<xsl:character-map name="xmlout">
<xsl:output-character character="<" string="<"/>
<xsl:output-character character=">" string=">"/>
<xsl:output-character character="&" string="&"/>
</xsl:character-map>
If you have other content being produced in the first pass, whose
correct output is threatened by this mapping, you may need to do some
additional replacements in your "tag-text" template, substituting
arbitrary characters (such as characters from the Unicode Private Use
area) for less-than, greater-than and ampersand, then adjusting the
character-map to map them back to their original forms.
This sort of markup hacking is not a road I'd recommend going down,
but if you have to do it, I can't really see a reason to do it in some
other language, if XSLT is what you're comfortable with. Michael made
a good point about using a proper parser (which I wouldn't implement
in XSLT, as a first choice, even though it would be possible) if you
can put together a proper grammar for your input, but if a few regex
substitutions can get you safely to clean XML, the above approach may
suffice.
-Brandon :)
On Tue, Dec 6, 2011 at 5:42 PM, Karlmarx R <karlmarxr(_at_)yahoo(_dot_)com>
wrote:
Hello David,
Yes, I do process the content in 2 stages, preprocess into one form of XML
and then further process that to my final XML form. BUT, BOTH are done in XSL
with one signle file and the problem that I reported is in first stage
conversion itself. To make things even more clear, here is a rough skeleton
and explanation of my process.I get the entire content of the input into a
variable $input-text, and then tokenize it to get each line of data into
another variable, as below.
<xsl:variable name="lines" select="tokenize($input-text, '\r?\n')"/>
<!--then pass it to another template to process each line of data:-->
<xsl:call-template name="process-lines">
<xsl:with-param name="lines" select="$lines"/>
</xsl:call-template>
<!-- And here, I further process it to select the REQUIRED value, -->
<xsl:template name="process-lines">
<xsl:param name="lines" as="xs:string*"/>
<xsl:for-each select="$lines">
<xsl:variable
name="line-components" select="tokenize(.,'\t')"/>
<xsl:for-each
select="$line-components[position() = last()]">
<value>
<xsl:call-template name="tag-text">
<xsl:with-param name="unparsed" select="."/>
</xsl:call-template>
</value>
</xsl:for-each>
<!-- AND IT IS HERE in this "ag-text" template, I try to achieve what I
explained in my original posting -->
<xsl:template name="tag-text">
<xsl:param name="unparsed" required="yes"/>
<xsl:analyze-string select="$unparsed"
regex="^(.*?)<(.+)>(.*)</(.+)>(.*?)$">
etc as posted earlier.
The skeleton input will be like (as I mentioned before):
Line one text <b>within valid node</b> and like <II .> Title etc
Line two with <1a .> Title etc, <i>within</i> <b>something</b> etc
another line can be just normal text
....
And it is vital here I get the data in the way I wanted, so that out final
output in stage two is correct. And inview of this I cannot use <value-of
select with d-o-e> here. As it seems this cannot be acheived by XSL (looks
likely) I am trying to get my source corrected. But if there are solution
available, in xsl or with better regex, I would be happy to use. I hope the
above clarifies your question.
Thanks,
Karl
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--