Since you appear to be using XSLT 2.0, it seems like character maps
would be the best solution XSLT has to offer. For your examples,
something like this might work (untested... YMMV):
<xsl:output method="xml" encoding="UTF-8" use-character-maps="PUAtoBMP"/>
<xsl:character-map name="PUAtoBMP">
<xsl:output-character character="a" string="x"/>
<xsl:output-character character="b" string="yz"/>
<xsl:output-character character="p" string="q<sup>r</sup>"/>
</xsl:character-map>
An XSLT to transform your mapping file into a suitable character map
should be relatively straightforward.
-Brandon :)
On Sat, Apr 23, 2011 at 10:27 PM, Birnbaum, David J
<djbpitt(_at_)pitt(_dot_)edu> wrote:
Dear XSLT list,
I would be grateful for some advice about how to conceptualize a project that
involves remapping the textual characters in an XML document using XSLT. Here
are the details:
Input: XML with text nodes that are encoded using (or, rather, abusing) the
Unicode Private Use Area (PUA). The original content creators ignored the
entire existing Unicode inventory and mapped every text character to
something in the PUA. (They had their reasons, but they were misguided.
Damage done.) In most cases their individual PUA characters have individual
counterparts in the Unicode Base Multilingual Plane (BMP). In some cases,
though, what they encoded as an individual PUA character needs to be replaced
by more than one BMP character, and in other cases the replacement also has
to incorporate markup. See below for details.
Desired output: XML with the PUA text remapped to appropriate Unicode BMP
values, with any necessary markup inserted.
Mappings: There are at least three types of relationships (mappings) between
the PUA text in the original and the Unicode BMP needed in the output:
1. One to one. A single PUA character should be replaced by a single Unicode
BMP character.
2. One to many. A single PUA character should be replaced by two or more
Unicode BMP characters. No additional marked is inserted.
3. Markup mapping. One PUA character is remapped to one or more Unicode BMP
characters, but with inserted markup (see example below).
The mapping file that specifies what needs to be replaced by what looks like
the following:
<mappings>
<mapping>
<original>a</original>
<unicode>x</unicode>
</mapping>
<!-- more one-to-one mappings -->
<many>
<mapping>
<original>b</original>
<unicode>yz</unicode>
</mapping>
<!-- more one-to-many mappings -->
</many>
<markup>
<mapping>
<original>p</original>
<unicode>q<sup>r</sup></unicode>
<mapping>
<!-- more markup mappings -->
</markup>
</mappings>
Individual <mapping> elements directly under the root <mappings> element are
one-to-one. The one-to-many <mapping> elements are grouped under <many>,
which is under <mappings>. The mappings that insert markup are grouped under
<markup>, which is also under <mappings>.
Possible strategies:
1. One to one. Concatenate the values into strings and use them in
translate(), e.g.:
<xsl:variable name="originals"
select="doc('mappings.xml')/mappings/mapping/original"/>
<xsl:variable name="replacements"
select="doc('mappings.xml')/mappings/mapping/unicode"/>
and then, later, after doing the more complicated type-2 and type-3
replacements, pass the output of the last of those replacements to:
translate($text,$originals,$replacements)
2. One to many. Use replace() recursively, iterating over the one-to-many
mapping pairs, and feeding the output of the final replace() operation into
the translate() function above as the value of $text.
These two pieces play well together, but the markup replacements (type 3)
complicate the picture. The first strategy that occurred to me was to start
the conversion with these, tokenize the text() node as individual characters,
look each character up in the markup/mapping/original elements, and use
<xsl:copy-of> to effect the replacement. That is, pass the initial input
text() node to:
<xsl:variable name="characters" select="for $i in string-to-codepoints(.)
codepoints-to-string($i)"/>
This gives me a sequence of individual PUA characters. For each one I then do
the following:
<xsl:for-each select="$characters">
<xsl:choose>
<xsl:when test=". = document('mappings.xml')//markup/mapping/original">
<xsl:copy-of
select="document('bbl-unicode.xml')//markup/mapping[original eq
current()]/unicode/node()"/>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="."/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
This is the first time I've ever seen <xsl:copy-of> used to copy something
other than the context node (or its children) in the document being
transformed; in this case it's copying the well-balanced XML from inside the
<unicode> element in mappings.xml, a different document. Is this as unusual
as I think, or have I just led a sheltered life? Or is it unusual because
it's wrong-headed?
In any case, once I seized on <xsl:copy-of> as a possible solution to
introducing markup as part of the replacement, I realized that I could also
have used it for the many-to-one mappings, since <xsl:copy-of
select="unicode/node()"/> returns the same result as <xsl:value-of
select="unicode"/> when <unicode> happens to contain only a single text node,
as it does in the one-to-many mappings. And the same would have worked for
the one-to-one mappings, as well, of course.
This raises another question about another possible complication. A more
general and robust solution would (should) also support many-to-many
mappings, possibly with inserted markup. In that case I can't just tokenize
the string into characters because sometimes a sequence of two or more
characters will be needed as the input value for the mapping pair. Is there a
good way to cater to that eventuality? <xsl:analyze-string> is unappealing
because I'm not sure how I would use it recursively, since once I've done a
replacement that inserts markup, I don't have a string any more, and I can't
just pass the result to another iteration of <xsl:analyze-string> without
having it converted to a string, with the loss of the markup I inserted.
My question, then, after this long-winded exposition, is: How should I have
conceptualized this task? I broke it down into three types of replacements
and adopted a different strategy for each, and I started with the easiest
(the one-to-one replacements). I then realized that the problem was more
general (there are other possible types of mappings), and also that there
were multiple ways to deal with some of the types of mapping. Finally, the
problem begins with a text() node, but once a replacement inserts some
markup, it's no longer just a text() node, so a recursive strategy that
requires with a pristine text() node as input may become inapplicable as the
replacements accrue.
On the one hand, this is a one-off transformation for a particular project,
and once it's done I'll never have to run it again, so efficiency of
execution isn't a high priority. On the other hand, these kinds of
gibberish-to-unicode remappings are very common in my world (legacy documents
in unusual writing systems), and I really should think about the general
problem type, instead of cobbling together a new ad hoc solution every time a
new project crosses my desk. I'd be grateful for any advice.
Cheers,
David
djbpitt(_at_)pitt(_dot_)edu
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail:
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--