xsl-list
[Top] [All Lists]

Re: [xsl] gibberish-to-unicode conversation

2011-04-23 22:20:36
Since you appear to be using XSLT 2.0, it seems like character maps
would be the best solution XSLT has to offer.  For your examples,
something like this might work (untested... YMMV):

    <xsl:output method="xml" encoding="UTF-8" use-character-maps="PUAtoBMP"/>
    <xsl:character-map name="PUAtoBMP">
        <xsl:output-character character="a" string="x"/>
        <xsl:output-character character="b" string="yz"/>
        <xsl:output-character character="p" string="q&lt;sup&gt;r&lt;/sup&gt;"/>
    </xsl:character-map>

An XSLT to transform your mapping file into a suitable character map
should be relatively straightforward.

-Brandon :)


On Sat, Apr 23, 2011 at 10:27 PM, Birnbaum, David J 
<djbpitt(_at_)pitt(_dot_)edu> wrote:
Dear XSLT list,

I would be grateful for some advice about how to conceptualize a project that 
involves remapping the textual characters in an XML document using XSLT. Here 
are the details:

Input: XML with text nodes that are encoded using (or, rather, abusing) the 
Unicode Private Use Area (PUA). The original content creators ignored the 
entire existing Unicode inventory and mapped every text character to 
something in the PUA. (They had their reasons, but they were misguided. 
Damage done.) In most cases their individual PUA characters have individual 
counterparts in the Unicode Base Multilingual Plane (BMP). In some cases, 
though, what they encoded as an individual PUA character needs to be replaced 
by more than one BMP character, and in other cases the replacement also has 
to incorporate markup. See below for details.

Desired output: XML with the PUA text remapped to appropriate Unicode BMP 
values, with any necessary markup inserted.

Mappings: There are at least three types of relationships (mappings) between 
the PUA text in the original and the Unicode BMP needed in the output:

1. One to one. A single PUA character should be replaced by a single Unicode 
BMP character.

2. One to many. A single PUA character should be replaced by two or more 
Unicode BMP characters. No additional marked is inserted.

3. Markup mapping. One PUA character is remapped to one or more Unicode BMP 
characters, but with inserted markup (see example below).

The mapping file that specifies what needs to be replaced by what looks like 
the following:

<mappings>
 <mapping>
   <original>a</original>
   <unicode>x</unicode>
 </mapping>
 <!-- more one-to-one mappings -->
 <many>
   <mapping>
     <original>b</original>
     <unicode>yz</unicode>
   </mapping>
   <!-- more one-to-many mappings -->
 </many>
 <markup>
   <mapping>
     <original>p</original>
     <unicode>q<sup>r</sup></unicode>
   <mapping>
   <!-- more markup mappings -->
 </markup>
</mappings>

Individual <mapping> elements directly under the root <mappings> element are 
one-to-one. The one-to-many <mapping> elements are grouped under <many>, 
which is under <mappings>. The mappings that insert markup are grouped under 
<markup>, which is also under <mappings>.

Possible strategies:

1. One to one. Concatenate the values into strings and use them in 
translate(), e.g.:

<xsl:variable name="originals" 
select="doc('mappings.xml')/mappings/mapping/original"/>
<xsl:variable name="replacements" 
 select="doc('mappings.xml')/mappings/mapping/unicode"/>

and then, later, after doing the more complicated type-2 and type-3 
replacements, pass the output of the last of those replacements to:

translate($text,$originals,$replacements)

2. One to many. Use replace() recursively, iterating over the one-to-many 
mapping pairs, and feeding the output of the final replace() operation into 
the translate() function above as the value of $text.

These two pieces play well together, but the markup replacements (type 3) 
complicate the picture. The first strategy that occurred to me was to start 
the conversion with these, tokenize the text() node as individual characters, 
look each character up in the markup/mapping/original elements, and use 
<xsl:copy-of> to effect the replacement. That is, pass the initial input 
text() node to:

 <xsl:variable name="characters" select="for $i in string-to-codepoints(.) 
codepoints-to-string($i)"/>

This gives me a sequence of individual PUA characters. For each one I then do 
the following:

<xsl:for-each select="$characters">
 <xsl:choose>
   <xsl:when test=". = document('mappings.xml')//markup/mapping/original">
     <xsl:copy-of
       select="document('bbl-unicode.xml')//markup/mapping[original eq 
current()]/unicode/node()"/>
   </xsl:when>
   <xsl:otherwise>
     <xsl:value-of select="."/>
   </xsl:otherwise>
 </xsl:choose>
</xsl:for-each>

This is the first time I've ever seen <xsl:copy-of> used to copy something 
other than the context node (or its children) in the document being 
transformed; in this case it's copying the well-balanced XML from inside the 
<unicode> element in mappings.xml, a different document. Is this as unusual 
as I think, or have I just led a sheltered life? Or is it unusual because 
it's wrong-headed?

In any case, once I seized on <xsl:copy-of> as a possible solution to 
introducing markup as part of the replacement, I realized that I could also 
have used it for the many-to-one mappings, since <xsl:copy-of 
select="unicode/node()"/> returns the same result as <xsl:value-of 
select="unicode"/> when <unicode> happens to contain only a single text node, 
as it does in the one-to-many mappings. And the same would have worked for 
the one-to-one mappings, as well, of course.

This raises another question about another possible complication. A more 
general and robust solution would (should) also support many-to-many 
mappings, possibly with inserted markup. In that case I can't just tokenize 
the string into characters because sometimes a sequence of two or more 
characters will be needed as the input value for the mapping pair. Is there a 
good way to cater to that eventuality? <xsl:analyze-string> is unappealing 
because I'm not sure how I would use it recursively, since once I've done a 
replacement that inserts markup, I don't have a string any more, and I can't 
just pass the result to another iteration of <xsl:analyze-string> without 
having it converted to a string, with the loss of the markup I inserted.

My question, then, after this long-winded exposition, is: How should I have 
conceptualized this task? I broke it down into three types of replacements 
and adopted a different strategy for each, and I started with the easiest 
(the one-to-one replacements). I then realized that the problem was more 
general (there are other possible types of mappings), and also that there 
were multiple ways to deal with some of the types of mapping. Finally, the 
problem begins with a text() node, but once a replacement inserts some 
markup, it's no longer just a text() node, so a recursive strategy that 
requires with a pristine text() node as input may become inapplicable as the 
replacements accrue.

On the one hand, this is a one-off transformation for a particular project, 
and once it's done I'll never have to run it again, so efficiency of 
execution isn't a high priority. On the other hand,  these kinds of 
gibberish-to-unicode remappings are very common in my world (legacy documents 
in unusual writing systems), and I really should think about the general 
problem type, instead of cobbling together a new ad hoc solution every time a 
new project crosses my desk. I'd be grateful for any advice.

Cheers,

David
djbpitt(_at_)pitt(_dot_)edu


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: 
<mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>