xsl-list
[Top] [All Lists]

Re: [xsl] special character encoding, two problems

2014-10-24 14:13:56
On Fri, Oct 24, 2014 at 04:27:18PM -0000, Jonina Dames 
jdames(_at_)inera(_dot_)com scripsit:
Hi Graydon,

Thanks for replying. I'm actually trying to get just plain ascii equivalents

Can you show me the plain ASCII equivalent for thorn?

Right now, the function I am using is this:

    <xsl:value-of 
select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"/>

What I'm unclear on is why the function is correctly converting "&#x00E9;"
to "e", but not "&#xf8;" to "o". 

In Unicode, you have different normal forms.

The usual normal form, and the one XML documents "expect", is the
composed normal form (NFC); where there's a single code point which
represents the combination of a letter and an accent, use that single
code point.

So, use é instead of e and a non-spacing modifier accent ˊ.

What the *decomposed* normal form -- the NFKD in the inner
normalize-unicode call -- does is say, no, no, if we can represent this
as a letter and some modifiers, do it that way.  So we get e´ and the ´
can be stripped by the replace as a member of the Unicode category
"modifiers, nonspacing".

When we get to ø, it's not a modified o, &#x006f; LATIN SMALL LETTER O,
but some other letter that just happens to *look* something like a
latin small letter O, it's not categorized as an O with a modifier
DESPITE being called "LATIN SMALL LETTER O WITH STROKE".  I have no idea
why; the ways of the Unicode Consortium are mysterious.  So decomposing
it doesn't produce an o and an accent, it produces U+00F8.  So when the
non-spacing modifiers are removed, nothing changes.

Is there a way to make this function convert all accented latin
letters to plain ascii characters? 

Well, technically, that's precisely and specifically what it does.  The
problem is that your clients appear to be having a disagreement with the
Unicode Consortium about which letters are really accented Latin and
which are letters in their own right.  (Which is why I keep bringing up
thorn; thorn is without question a letter with no direct ASCII analog.)

We really need coverage for any letter that can appear in a European
name, so this should also convert the numeric character reference for
thorn (þ, &#xfe;) to one or more plain ascii characters, to cover
authors from Iceland.

But what?  Thorn isn't really "th", just like edth isn't really "dh".

(From time to time we've all had clients who were perhaps a little mad.
This makes everyone much less willing to guess just how your particular
client might be mad, rather than more, because madness is such a wide
country.)

I ran a broad test of all the accented latin letters most likely to occur in
author names, and these 28 characters are the only ones that were not
converted to plain ascii equivalents:
[snip list] 
Is there a different set of flags for this function that will yield the
result I'm looking for? 

What result *are* you looking for?  Many of those letters have no ASCII
equivalent and are not generally considered the same for sorting.
(Torvalds and Þorvalds really shouldn't be sorted as the same author,
for example.)

But, specifically, no; there are five choices of normalization scheme,
C, D, KC, KD, and "fully normalized".

"C" is "composed" and "D" is "decomposed".

The "K" stands for, I presume, "compatibility"; KC and KD are the
stronger forms that normalize away compatible characters.  (Unicode
includes multiple representations of some characters because Unicode
combines a bunch of pre-existing character representations.  The K
variants pick the most canonical representation of the character.)  So
the second argument for the decomposition is already as strong as you
can get it.

("Fully normalized" has to do with string concatenation and is composed,
anyway, so it won't help you here.)

If this function cannot do that, what is the best way to convert all
of these outlying characters? I need this conversion to happen in only
one element of my XML, not the entire XML document. I can't use
translate because it's a one-to-one conversion that doesn't cover the
ligatures listed above. If normalize-unicode cannot be made to cover
all the characters listed above, can character-maps be applied that
act specifically on only one element?

Character maps apply to the whole result document, so they won't do what
you want here.

It looks to me like your best bet is to create a function that applies
the decompose-replace-recompose trick, and then uses replace()
repeatedly to find all your remaining problematic letters and replaces
those strings with whatever, specifically, needs to be used in place of
those letters, invoking that function to provide the contents of the
specific element that needs its contents altered like this.

It will be an ugly function but it at least gets to be very specific to
your client's particular needs.

-- Graydon
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--

<Prev in Thread] Current Thread [Next in Thread>