xsl-list
[Top] [All Lists]

Re: Switching off character entity resolution in XSL

2004-02-03 05:23:17
On Tue, 2004-02-03 at 03:11, AHynes(_at_)cch(_dot_)com(_dot_)au wrote:
Hello All,

Unlike what most people would use XSL for (i.e. conversion of XML to HTML
or other output format), I have a requirement to transform from one XML
structure to another (subsequent presentation rendering occuring way
downstream). No big deal I guess, but the annoying thing here is that by
the time an XML parser has done it's job as per the XML specification, all
those pesky character entities have been resolved (as defined in the DTD
for the source document) and the output contains square brackets.

Example:
source document contains:     •
After transformation:         [bull  ]    (of course, the entity declared
in the DTD is this, i.e. <!ENTITY bull "[bull  ]">)
What I would like:            &bull;

This looks like it's either an old DTD converted from SGML unedited,
or a DTD written by someone who was unaware that XML shouldn't need 
to use character entities. In practice there are always reasons: an
editor which cannot generate all the required characters is one
common problem.

I really don't want to go messing with the DTD either, and I really don't
think a parser would like there being unparsed entities within an entity
declaration in a  DTD i.e. <!ENTITY bull &bull;> is illegal.

So, alas, is a recursive reference like <!ENTITY bull "&#38;bull;">,
at least in Saxon and I assume in other processors as well.

I realise there is some way of dealing with this with character
substitutions before or after using something like sed, but this isn't
really a great solution, particularly across platforms. Is there any way of
manipulating the output using XSL, or alternatively switching off entity
resolution in the parser? 

I don't think so, but you can add to the internal subset a 
declaration of the character entities you want output as something 
else, eg

<?xml version="1.0"?>
<!DOCTYPE whatever SYSTEM "some.dtd" [
<!ENTITY bull "&#x2022;">
]>

This will output a "real" bullet as a numeric character reference.
If you have copies of the character entity declaration files (eg
from the distribution of DocBook) you could reference them in the
internal subset instead, so that all the declarations override
any in the DTD.

Is there a reason why your output should need to preserve the
character entity format?

///Peter



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list