xsl-list
[Top] [All Lists]

[xsl] gibberish-to-unicode conversion

2011-04-25 00:51:39
Dear XSL List,

Thanks for the quick responses to my inquiry about Unicode conversion. A few 
thoughts:

The main thing that comes to mind is: Did this need to be done in XSLT?

I had thought of doing the job using a general-purpose scripting language, such 
as Python, and preferred an XSLT approach for the following reasons, the first 
actual and the second more theoretical:

1. The PUA values in the input could be serialized as raw characters or as 
numerical character references, the latter in decimal or hex. Matching on the 
lexical (string) value with a general-purpose scripting language seems as if it 
might be more complicated than matching with XSLT and XPath, where the 
different lexical representations would all be recognized as equivalent when 
the input was parsed prior to transformation. 

2. In this project the conversion of the PUA values is unambiguous, which is to 
say that wherever they occur, they should always be converted to the same 
Unicode BMP values. Assuming issue #1 above could be resolved, I wouldn't need 
access to the XML tree to perform the conversion, which means that a 
general-purpose scripting language would do the job. With a more general 
solution in mind, though, I was thinking of similar conversion projects where, 
for example, instead of PUA characters the input XML might use 7-bit ASCII to 
represent both real 7-bit ASCII values (letters of the Latin script) and, say, 
Cyrillic, so that <span writing="latin">a</span> would represent a Latin Small 
Letter A (U+0061) and <span writing="cyrillic">a</span> would also contain a 
lexical U+0061, but in this context it would be intended to represent (and 
would need to be converted to) a Cyrillic Small Letter A (U+0430). An XSLT 
approach lets me use XPath to maintain the state of the writing system, 
converting text nodes inside an element differently depending on the value of 
the @writing attribute on the parent element.

Since you appear to be using XSLT 2.0, it seems like character maps would be 
the 
best solution XSLT has to offer ...

Yes, I'm using XSLT 2.0, and I had never thought of character maps (which I've 
never used at all before, so I'm especially grateful for being reminded of 
their existence). A quick look in Michael Kay's book confirms that a character 
map would let me write out the markup easily, but as far as I can tell, the 
value of the @character attribute in an <xsl:output-character > element must be 
a single character, so in a scenario where I may need to convert, say, "ab" to 
"x<sup>y</sup>z", I can't specify "ab" as the value of the @character 
attribute. (This wasn't part of my original spec, but it was one of the 
additional considerations I introduced at the end, when I was mulling over how 
to make the solution more generalizable.) I also wonder about the philosophical 
implications of using a character map (forgive me, but as an academic, I can't 
think about getting the job done without reflecting on whether I'm doing it The 
Right Way). Character maps are intended, it appears, to control serialization 
and as a replacement for output escaping, which may not be properly the 
business of the XML parser and the XSLT engine, but using them, especially to 
generate tags, creates an opportunity to produce output that is not well-formed 
XML. I can be scrupulous about not doing that, of course, but it feels a bit 
non-XSLTistic. That's not an argument against using a character map when it 
gets the job done, of course, but I think this may be why I never thought of 
trying to write out angle brackets and the like directly, and was drawn instead 
to the <xsl:copy-of> strategy, where what I was copying was well-balanced XML.

In any case, converting "ab" to "x<sup>y</sup>z" seems to be the thorniest 
remaining issue, especially if it has to be used recursively (that is, if the 
same input string has to be passed through several such mappings), since after 
the first match the output is no longer just a string, and therefore can't be 
scanned the same way as the original pure string input. Suggestions welcome, of 
course, and thanks again to those who responded!

Cheers,

David
djbpitt(_at_)pitt(_dot_)edu


--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--

<Prev in Thread] Current Thread [Next in Thread>
  • [xsl] gibberish-to-unicode conversion, Birnbaum, David J <=