xsl-list
[Top] [All Lists]

RE: Generating numeric character references

2003-01-16 02:44:37

I think the original poster had a problem of double escaping, such as

& a m p ; # 1 7 3 ;

in their source, and they simply wanted to convert this to the correct & # 1 7 
3 ;

Wouldn't running the source xml through an indentity transform would give the 
desired result, no need for string processing of any kind.....

cheers
andrew


-----Original Message-----
From: Wendell Piez [mailto:wapiez(_at_)mulberrytech(_dot_)com]
Sent: 14 January 2003 21:55
To: xsl-list(_at_)lists(_dot_)mulberrytech(_dot_)com
Subject: Re: [xsl] Generating numeric character references


Stuart,

The reason your task is proving difficult is that it's really 
not what it 
appears to be at first blush. There is a trap here, which you 
can recognize 
if you can clearly distinguish between XML-as-serialization 
format, and the 
XML document (a tree of nodes as described in the XPath spec) 
that an XSLT 
processor operates on.

Numeric character references may appear in 
XML-as-serialization; in the 
XPath tree (the "document" built by the parser and handed to the XSLT 
engine), however, these references never appear as such; 
rather, each has 
been converted into the character it represents.

So, for example, if your data has character reference A, 
your XSLT 
processor sees this as an "A". (It may come out the back as 
"A" if 
your serialization encoding happens not to be able to do a 
proper "A", but 
internally it's an "A"). Therefore, what's required with 
"A" isn't 
to turn it into "A", but rather into "A". (Or, if you 
get my drift: 
you need to convert "A" into "A" *before* your 
document is 
parsed, or an "A" into an "A" *after* your document is parsed.)

You are currently trying to do the latter; and it can be done 
-- as you're 
discovering -- with recursive processing over text nodes, 
heuristics to 
recognize target substrings, and a table to map them. But 
it's not a job 
that XSLT lends itself towards, since XSLT is as ungainly for 
processing 
strings as it is slick for processing nodes. Far preferable 
would be to use 
Perl or something else with good support for string-handling 
and regular 
expressions, to do the former task (munge the & entities 
before parsing).

Yet -- and this is where one has to be *very* cautious -- 
XSLT does, at 
least in certain circumstances (i.e. with certain processors 
in certain 
operational contexts) give you *some* control over how a 
document, once 
processed, is serialized -- and *if your data is clean* this optional 
feature can be drafted into service to help with your 
problem. What I'm 
getting to, of course, is the dreaded disable-output-escaping....

That is, if your data is otherwise unproblematic, you can 
achieve your goal 
by running your document through a near-identity transform 
that disables 
output escaping on your text nodes. The document will emerge from the 
transform unchanged (at least as XPath sees it) but with "&#x41" 
represented as "A". This, *when parsed again*, will be 
seen as the "A" 
you really want.

Note that this is not (if we're really strict with our terms) a 
transformation in the XSLT sense. Rather, it's a tricky 
application of the 
serializer attached to most processors, will sometimes break 
because it 
disables escaping on the wrong characters (so if you have any 
data such as 
"if x < y", you're going to be in trouble unless you write 
string-processing code to catch and work around it), and uses 
an optional 
feature of the language that restricts portability.

Please consider this only a golden-hammer solution (i.e. 
lacking a better 
tool to do the job), and keep in mind it's easy to bang your 
thumb this way 
(since any anomalies in the input will make your output not 
well-formed). 
It is in view of these limitations that this really should be 
done in a 
separate pass, if with XSLT at all.

Cheers,
Wendell

  At 03:05 PM 1/14/2003, you wrote:
I'd like to transform specific text subtrings into numeric character
references during in an XSLT transformation. For example, I want to
transform all occurrences that look like "­" within a string
into "&#173".

Here's the back story. I have source XML that is generated 
automatically
from HTML by a third-party. The third-party incorrectly 
handles entity
references, so that "­" in the original HTML in becomes
"­" in the XML. I want to restore the damage done. 
To simplify
things, I am only interested in documents with ISO 8859-1 encoding.

Below is a solution [1] that I am not pleased with. It is a named
template that recursively parses a string, trying to replace 
references.
This requires an <xsl:when> element for each value of 
numeric character
reference that might be encountered (see the "additional cases here"
comment). Problems with this include linear search of values, omitted
values, and opportunity for error in mismatched values.

Can anyone suggest a better approach to generating numeric character
references? I am would be fine restricting the solution to MSXML or
.NET's System.Xml.Xsl XSLT processors, if that is an issue.

Many thanks!

Cheers,
Stuart



[1] A less than happy solution:

  <xsl:template name="restoreNumCharRefs">
    <xsl:param name="string"/>

    <xsl:choose>
      <xsl:when test="contains($string, '&amp;')">
        <xsl:variable name="head" select="substring-before($string,
'&amp;')"/>
        <xsl:variable name="remainder" 
select="substring-after($string,
'&amp;')"/>
        <xsl:variable name="reference"
select="substring-before($remainder, ';')"/>

        <xsl:variable name="entity">
          <xsl:choose>
            <xsl:when test="$reference='#167'">&#167;</xsl:when>
            <xsl:when test="$reference='#173'">&#173;</xsl:when>

            <!-- additional cases here -->

            <xsl:otherwise>&amp;<xsl:value-of
select="$reference"/>;</xsl:otherwise>
          </xsl:choose>
        </xsl:variable>

        <xsl:variable name="tail">
          <xsl:call-template name=" restoreNumCharRefs">
            <xsl:with-param name="string"
select="substring-after($remainder, ';')"/>
          </xsl:call-template>
        </xsl:variable>

        <xsl:value-of select="concat($head, $entity, $tail)"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="$string"/>
      </xsl:otherwise>
    </xsl:choose>

  </xsl:template>


 XSL-List info and archive:  
http://www.mulberrytech.com/xsl/xsl-list


======================================================================
Wendell Piez                            
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



---
Incoming mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
 

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.441 / Virus Database: 247 - Release Date: 09/01/2003
 

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list