RE: Generating numeric character references
2003-01-16 02:57:44
Hi,
This won't work.
If you took the results of this transform and gave them to SAX
ContentHandler you'd get a 'characters' call with the string
"­", not with the single character represented by U+00AD.
Also, if you re-serialised the result, you end up back where
you started: & a m p ; # 1 7 3 ;
Dan.
--
Danny Yates
Technical Architect
Abbey National Treasury Services
E-mail: Danny(_dot_)Yates(_at_)ants(_dot_)co(_dot_)uk
Phone: +44 20 7756 5012
Fax: +44 20 7612 4342
-----Original Message-----
From: Andrew Welch [mailto:AWelch(_at_)piper-group(_dot_)com]
Sent: 16 January 2003 09:45
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: RE: [xsl] Generating numeric character references
I think the original poster had a problem of double escaping, such as
& a m p ; # 1 7 3 ;
in their source, and they simply wanted to convert this to the correct & # 1
7 3 ;
Wouldn't running the source xml through an indentity transform would give
the desired result, no need for string processing of any kind.....
cheers
andrew
-----Original Message-----
From: Wendell Piez [mailto:wapiez(_at_)mulberrytech(_dot_)com]
Sent: 14 January 2003 21:55
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Generating numeric character references
Stuart,
The reason your task is proving difficult is that it's really
not what it
appears to be at first blush. There is a trap here, which you
can recognize
if you can clearly distinguish between XML-as-serialization
format, and the
XML document (a tree of nodes as described in the XPath spec)
that an XSLT
processor operates on.
Numeric character references may appear in
XML-as-serialization; in the
XPath tree (the "document" built by the parser and handed to the XSLT
engine), however, these references never appear as such;
rather, each has
been converted into the character it represents.
So, for example, if your data has character reference A,
your XSLT
processor sees this as an "A". (It may come out the back as
"A" if
your serialization encoding happens not to be able to do a
proper "A", but
internally it's an "A"). Therefore, what's required with
"A" isn't
to turn it into "A", but rather into "A". (Or, if you
get my drift:
you need to convert "A" into "A" *before* your
document is
parsed, or an "A" into an "A" *after* your document is parsed.)
You are currently trying to do the latter; and it can be done
-- as you're
discovering -- with recursive processing over text nodes,
heuristics to
recognize target substrings, and a table to map them. But
it's not a job
that XSLT lends itself towards, since XSLT is as ungainly for
processing
strings as it is slick for processing nodes. Far preferable
would be to use
Perl or something else with good support for string-handling
and regular
expressions, to do the former task (munge the & entities
before parsing).
Yet -- and this is where one has to be *very* cautious --
XSLT does, at
least in certain circumstances (i.e. with certain processors
in certain
operational contexts) give you *some* control over how a
document, once
processed, is serialized -- and *if your data is clean* this optional
feature can be drafted into service to help with your
problem. What I'm
getting to, of course, is the dreaded disable-output-escaping....
That is, if your data is otherwise unproblematic, you can
achieve your goal
by running your document through a near-identity transform
that disables
output escaping on your text nodes. The document will emerge from the
transform unchanged (at least as XPath sees it) but with "&#x41"
represented as "A". This, *when parsed again*, will be
seen as the "A"
you really want.
Note that this is not (if we're really strict with our terms) a
transformation in the XSLT sense. Rather, it's a tricky
application of the
serializer attached to most processors, will sometimes break
because it
disables escaping on the wrong characters (so if you have any
data such as
"if x < y", you're going to be in trouble unless you write
string-processing code to catch and work around it), and uses
an optional
feature of the language that restricts portability.
Please consider this only a golden-hammer solution (i.e.
lacking a better
tool to do the job), and keep in mind it's easy to bang your
thumb this way
(since any anomalies in the input will make your output not
well-formed).
It is in view of these limitations that this really should be
done in a
separate pass, if with XSLT at all.
Cheers,
Wendell
At 03:05 PM 1/14/2003, you wrote:
I'd like to transform specific text subtrings into numeric character
references during in an XSLT transformation. For example, I want to
transform all occurrences that look like "&#173;" within a string
into "­".
Here's the back story. I have source XML that is generated
automatically
from HTML by a third-party. The third-party incorrectly
handles entity
references, so that "­" in the original HTML in becomes
"&#173;" in the XML. I want to restore the damage done.
To simplify
things, I am only interested in documents with ISO 8859-1 encoding.
Below is a solution [1] that I am not pleased with. It is a named
template that recursively parses a string, trying to replace
references.
This requires an <xsl:when> element for each value of
numeric character
reference that might be encountered (see the "additional cases here"
comment). Problems with this include linear search of values, omitted
values, and opportunity for error in mismatched values.
Can anyone suggest a better approach to generating numeric character
references? I am would be fine restricting the solution to MSXML or
.NET's System.Xml.Xsl XSLT processors, if that is an issue.
Many thanks!
Cheers,
Stuart
[1] A less than happy solution:
<xsl:template name="restoreNumCharRefs">
<xsl:param name="string"/>
<xsl:choose>
<xsl:when test="contains($string, '&')">
<xsl:variable name="head" select="substring-before($string,
'&')"/>
<xsl:variable name="remainder"
select="substring-after($string,
'&')"/>
<xsl:variable name="reference"
select="substring-before($remainder, ';')"/>
<xsl:variable name="entity">
<xsl:choose>
<xsl:when test="$reference='#167'">§</xsl:when>
<xsl:when test="$reference='#173'">­</xsl:when>
<!-- additional cases here -->
<xsl:otherwise>&<xsl:value-of
select="$reference"/>;</xsl:otherwise>
</xsl:choose>
</xsl:variable>
<xsl:variable name="tail">
<xsl:call-template name=" restoreNumCharRefs">
<xsl:with-param name="string"
select="substring-after($remainder, ';')"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($head, $entity, $tail)"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$string"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
XSL-List info and archive:
http://www.mulberrytech.com/xsl/xsl-list
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
***************************************************************************
This communication (including any attachments) contains confidential
information. If you are not the intended recipient and you have received this
communication in error, you should destroy it without copying, disclosing or
otherwise using its contents. Please notify the sender immediately of the
error.
Internet communications are not necessarily secure and may be intercepted or
changed after they are sent. Abbey National Treasury Services plc does not
accept liability for any loss you may suffer as a result of interception or any
liability for such changes. If you wish to confirm the origin or content of
this communication, please contact the sender by using an alternative means of
communication.
This communication does not create or modify any contract and, unless otherwise
stated, is not intended to be contractually binding.
Abbey National Treasury Services plc. Registered Office: Abbey National House,
2 Triton Square, Regents Place, London NW1 3AN. Registered in England under
Company Registration Number: 2338548. Regulated by the Financial Services
Authority (FSA).
***************************************************************************
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
|