RE: Generating numeric character references
2003-01-16 10:23:45
Andy,
At 04:44 AM 1/16/2003, you wrote:
I think the original poster had a problem of double escaping, such as
& a m p ; # 1 7 3 ;
in their source, and they simply wanted to convert this to the correct & #
1 7 3 ;
Thanks for spacing for legibility. I didn't do that (and I wonder now if it
made my post unintelligible to anyone -- sorry).
Wouldn't running the source xml through an indentity transform would give
the desired result, no need for string processing of any kind.....
Well, not exactly. There's a problem here between "reality" and
"representation". In order to beg the metaphysical problem here of which is
which (a problem which is not negligible, indeed is at the heart of a deep
contention respecting appropriate design strategies for the XML family of
specs), I'll call them "external" and "internal". "External" means
XML-as-serialized; it's also the way you write a stylesheet (which is,
after all, XML serialized). "Internal" means the XPath tree once the parser
has done its job and handed the structured data to the processor.
What Stuart wants is to move from "Before" to "After":
external internal
Before & amp; #x41; & #x41;
After & #x41; A
A straight identity transform would:
1. Parse "external" into "internal". & amp; #x41; becomes & #x41;
2. Input tree is copied to output tree (identity transform)
3. Output tree is serialized: internal is expressed as external and & #x41;
becomes & amp; #x41;
So the straight identity transform keeps "Before" as "Before" (as it
should, being an identity transform).
But Stuart doesn't want Before; he wants After. While this may seem like it
ought to be trivial (internally, what we have Before is exactly what we
want externally After), it's not, since we have to get across the
architectural boundary (Mike K's phrase) between the serialized XML and the
parsed XML-as-XPath-tree. If parsers and serializers are doing their jobs
properly, they shouldn't allow this -- an internal "& #x41;" should always
serialize as "& amp; #x41;", no exceptions (please elide the safety spaces
here: I just hate e-mail clients that parse plain text!).
Tom P's suggestion is to pre-process, observing that the simplest and
cleanest approach is to run a routine over the external form of the XML to
turn Before into After, and not to worry about the parser (not to worry
about what's "internal") until he's got the data the way he wants it.
Architecturally, this is a good solution (it maintains the boundary), and
it'll be speedy since he'll use a tool (I think Tom would use Python ;-)
well-suited for string-munging without XML parsing.
If Stuart must do this inside the XSLT processor, however, he has no choice
but to work on the internal form.
His first approach was to map occurrences of the string "& #x41;" (again no
space) into the correct character, "A" (and let the serializer do whatever
it wants with the result). Like Tom's approach, this is safe, since it
respects the boundary, but (as Stuart noted) its performance may be
questionable, and it's something of a pain to program (XSLT isn't as well
suited for string-processing as many other tools).
It's also -confusing- since while you are really changing *& x#41;" into
"A", it appears you are changing "& amp; #x41;" into "& #41;", since of
course, *you see the external representation both in your source and result
files, and in your XSLT*.
My evil suggestion (and more confusing) was to commandeer his serializer
into writing "& #x41;" for the internal "& #x41;". (On my diagram, this
amounts to using the serializer to jump diagonally instead of using an
orthodox process to move vertically.)
This may be an acceptable brute force method, sometimes. It will gain speed
over the internal-mapping approach. Unlike an external process
(pre-process), it happens within the XSLT architecture (or rather, across
it). It is fairly simple to program.
It *does* require that the data is otherwise sparkly-clean, or the wrong
characters will fail to get escaped on serializing, the "XML" will not be
well-formed coming out, and Stuart will be hosed, unable to parse his data
until he fixes it with non-XML string-munging tools (which is what he says
he's not allowed to do).
Sorry for the long post, but it's a tricky topic and one that gets lots of
folks really stuck.
Cheers,
Wendell
======================================================================
Wendell Piez
mailto:wapiez(_at_)mulberrytech(_dot_)com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
|