Re: Generating numeric character references
2003-01-14 14:55:08
Stuart,
The reason your task is proving difficult is that it's really not what it
appears to be at first blush. There is a trap here, which you can recognize
if you can clearly distinguish between XML-as-serialization format, and the
XML document (a tree of nodes as described in the XPath spec) that an XSLT
processor operates on.
Numeric character references may appear in XML-as-serialization; in the
XPath tree (the "document" built by the parser and handed to the XSLT
engine), however, these references never appear as such; rather, each has
been converted into the character it represents.
So, for example, if your data has character reference A, your XSLT
processor sees this as an "A". (It may come out the back as "A" if
your serialization encoding happens not to be able to do a proper "A", but
internally it's an "A"). Therefore, what's required with "A" isn't
to turn it into "A", but rather into "A". (Or, if you get my drift:
you need to convert "A" into "A" *before* your document is
parsed, or an "A" into an "A" *after* your document is parsed.)
You are currently trying to do the latter; and it can be done -- as you're
discovering -- with recursive processing over text nodes, heuristics to
recognize target substrings, and a table to map them. But it's not a job
that XSLT lends itself towards, since XSLT is as ungainly for processing
strings as it is slick for processing nodes. Far preferable would be to use
Perl or something else with good support for string-handling and regular
expressions, to do the former task (munge the & entities before parsing).
Yet -- and this is where one has to be *very* cautious -- XSLT does, at
least in certain circumstances (i.e. with certain processors in certain
operational contexts) give you *some* control over how a document, once
processed, is serialized -- and *if your data is clean* this optional
feature can be drafted into service to help with your problem. What I'm
getting to, of course, is the dreaded disable-output-escaping....
That is, if your data is otherwise unproblematic, you can achieve your goal
by running your document through a near-identity transform that disables
output escaping on your text nodes. The document will emerge from the
transform unchanged (at least as XPath sees it) but with "&#x41"
represented as "A". This, *when parsed again*, will be seen as the "A"
you really want.
Note that this is not (if we're really strict with our terms) a
transformation in the XSLT sense. Rather, it's a tricky application of the
serializer attached to most processors, will sometimes break because it
disables escaping on the wrong characters (so if you have any data such as
"if x < y", you're going to be in trouble unless you write
string-processing code to catch and work around it), and uses an optional
feature of the language that restricts portability.
Please consider this only a golden-hammer solution (i.e. lacking a better
tool to do the job), and keep in mind it's easy to bang your thumb this way
(since any anomalies in the input will make your output not well-formed).
It is in view of these limitations that this really should be done in a
separate pass, if with XSLT at all.
Cheers,
Wendell
At 03:05 PM 1/14/2003, you wrote:
I'd like to transform specific text subtrings into numeric character
references during in an XSLT transformation. For example, I want to
transform all occurrences that look like "&#173;" within a string
into "­".
Here's the back story. I have source XML that is generated automatically
from HTML by a third-party. The third-party incorrectly handles entity
references, so that "­" in the original HTML in becomes
"&#173;" in the XML. I want to restore the damage done. To simplify
things, I am only interested in documents with ISO 8859-1 encoding.
Below is a solution [1] that I am not pleased with. It is a named
template that recursively parses a string, trying to replace references.
This requires an <xsl:when> element for each value of numeric character
reference that might be encountered (see the "additional cases here"
comment). Problems with this include linear search of values, omitted
values, and opportunity for error in mismatched values.
Can anyone suggest a better approach to generating numeric character
references? I am would be fine restricting the solution to MSXML or
.NET's System.Xml.Xsl XSLT processors, if that is an issue.
Many thanks!
Cheers,
Stuart
[1] A less than happy solution:
<xsl:template name="restoreNumCharRefs">
<xsl:param name="string"/>
<xsl:choose>
<xsl:when test="contains($string, '&')">
<xsl:variable name="head" select="substring-before($string,
'&')"/>
<xsl:variable name="remainder" select="substring-after($string,
'&')"/>
<xsl:variable name="reference"
select="substring-before($remainder, ';')"/>
<xsl:variable name="entity">
<xsl:choose>
<xsl:when test="$reference='#167'">§</xsl:when>
<xsl:when test="$reference='#173'">­</xsl:when>
<!-- additional cases here -->
<xsl:otherwise>&<xsl:value-of
select="$reference"/>;</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="tail">
<xsl:call-template name=" restoreNumCharRefs">
<xsl:with-param name="string"
select="substring-after($remainder, ';')"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($head, $entity, $tail)"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
|