xsl-list
[Top] [All Lists]

[xsl] gibberish-to-unicode conversation

2011-04-23 21:27:37
Dear XSLT list,

I would be grateful for some advice about how to conceptualize a project that 
involves remapping the textual characters in an XML document using XSLT. Here 
are the details:

Input: XML with text nodes that are encoded using (or, rather, abusing) the 
Unicode Private Use Area (PUA). The original content creators ignored the 
entire existing Unicode inventory and mapped every text character to something 
in the PUA. (They had their reasons, but they were misguided. Damage done.) In 
most cases their individual PUA characters have individual counterparts in the 
Unicode Base Multilingual Plane (BMP). In some cases, though, what they encoded 
as an individual PUA character needs to be replaced by more than one BMP 
character, and in other cases the replacement also has to incorporate markup. 
See below for details.

Desired output: XML with the PUA text remapped to appropriate Unicode BMP 
values, with any necessary markup inserted.

Mappings: There are at least three types of relationships (mappings) between 
the PUA text in the original and the Unicode BMP needed in the output:

1. One to one. A single PUA character should be replaced by a single Unicode 
BMP character.

2. One to many. A single PUA character should be replaced by two or more 
Unicode BMP characters. No additional marked is inserted.

3. Markup mapping. One PUA character is remapped to one or more Unicode BMP 
characters, but with inserted markup (see example below).

The mapping file that specifies what needs to be replaced by what looks like 
the following:

<mappings>
  <mapping>
    <original>a</original>
    <unicode>x</unicode>
  </mapping>
  <!-- more one-to-one mappings -->
  <many>
    <mapping>
      <original>b</original>
      <unicode>yz</unicode>
    </mapping>
    <!-- more one-to-many mappings -->
  </many>
  <markup>
    <mapping>
      <original>p</original>
      <unicode>q<sup>r</sup></unicode>
    <mapping>
    <!-- more markup mappings -->
  </markup>
</mappings>

Individual <mapping> elements directly under the root <mappings> element are 
one-to-one. The one-to-many <mapping> elements are grouped under <many>, which 
is under <mappings>. The mappings that insert markup are grouped under 
<markup>, which is also under <mappings>.

Possible strategies:

1. One to one. Concatenate the values into strings and use them in translate(), 
e.g.:

<xsl:variable name="originals" 
select="doc('mappings.xml')/mappings/mapping/original"/>
<xsl:variable name="replacements"  
select="doc('mappings.xml')/mappings/mapping/unicode"/>

and then, later, after doing the more complicated type-2 and type-3 
replacements, pass the output of the last of those replacements to:

translate($text,$originals,$replacements)

2. One to many. Use replace() recursively, iterating over the one-to-many 
mapping pairs, and feeding the output of the final replace() operation into the 
translate() function above as the value of $text.

These two pieces play well together, but the markup replacements (type 3) 
complicate the picture. The first strategy that occurred to me was to start the 
conversion with these, tokenize the text() node as individual characters, look 
each character up in the markup/mapping/original elements, and use 
<xsl:copy-of> to effect the replacement. That is, pass the initial input text() 
node to:

  <xsl:variable name="characters" select="for $i in string-to-codepoints(.) 
codepoints-to-string($i)"/>

This gives me a sequence of individual PUA characters. For each one I then do 
the following:

<xsl:for-each select="$characters">
  <xsl:choose>
    <xsl:when test=". = document('mappings.xml')//markup/mapping/original">
      <xsl:copy-of
        select="document('bbl-unicode.xml')//markup/mapping[original eq 
current()]/unicode/node()"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="."/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:for-each>

This is the first time I've ever seen <xsl:copy-of> used to copy something 
other than the context node (or its children) in the document being 
transformed; in this case it's copying the well-balanced XML from inside the 
<unicode> element in mappings.xml, a different document. Is this as unusual as 
I think, or have I just led a sheltered life? Or is it unusual because it's 
wrong-headed?

In any case, once I seized on <xsl:copy-of> as a possible solution to 
introducing markup as part of the replacement, I realized that I could also 
have used it for the many-to-one mappings, since <xsl:copy-of 
select="unicode/node()"/> returns the same result as <xsl:value-of 
select="unicode"/> when <unicode> happens to contain only a single text node, 
as it does in the one-to-many mappings. And the same would have worked for the 
one-to-one mappings, as well, of course.

This raises another question about another possible complication. A more 
general and robust solution would (should) also support many-to-many mappings, 
possibly with inserted markup. In that case I can't just tokenize the string 
into characters because sometimes a sequence of two or more characters will be 
needed as the input value for the mapping pair. Is there a good way to cater to 
that eventuality? <xsl:analyze-string> is unappealing because I'm not sure how 
I would use it recursively, since once I've done a replacement that inserts 
markup, I don't have a string any more, and I can't just pass the result to 
another iteration of <xsl:analyze-string> without having it converted to a 
string, with the loss of the markup I inserted.

My question, then, after this long-winded exposition, is: How should I have 
conceptualized this task? I broke it down into three types of replacements and 
adopted a different strategy for each, and I started with the easiest (the 
one-to-one replacements). I then realized that the problem was more general 
(there are other possible types of mappings), and also that there were multiple 
ways to deal with some of the types of mapping. Finally, the problem begins 
with a text() node, but once a replacement inserts some markup, it's no longer 
just a text() node, so a recursive strategy that requires with a pristine 
text() node as input may become inapplicable as the replacements accrue.

On the one hand, this is a one-off transformation for a particular project, and 
once it's done I'll never have to run it again, so efficiency of execution 
isn't a high priority. On the other hand,  these kinds of gibberish-to-unicode 
remappings are very common in my world (legacy documents in unusual writing 
systems), and I really should think about the general problem type, instead of 
cobbling together a new ad hoc solution every time a new project crosses my 
desk. I'd be grateful for any advice.

Cheers,

David
djbpitt(_at_)pitt(_dot_)edu


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>