Re: [xsl] special character encoding, two problems

Hi Graydon,

Thanks for replying. I'm actually trying to get just plain asciiequivalents for all the accented letters because our customer needsplain ascii versions of author names for indexing purposes. Right now,the normalize-unicode function is working correctly for most accentedletters, like an acute e (é, é) transforms into a plain ascii"e". Characters like o with a slash (ø, ø) are NOT being convertedto a plain ascii "o".


Right now, the function I am using is this:

<xsl:value-ofselect="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"/>

What I'm unclear on is why the function is correctly converting"é" to "e", but not "ø" to "o". Is there a way to make thisfunction convert all accented latin letters to plain ascii characters?We really need coverage for any letter that can appear in a Europeanname, so this should also convert the numeric character reference forthorn (þ, þ) to one or more plain ascii characters, to coverauthors from Iceland.

I ran a broad test of all the accented latin letters most likely tooccur in author names, and these 28 characters are the only ones thatwere not converted to plain ascii equivalents:


&#xc6;    Æ
&#xd0;    Ð
&#xd8;    Ø
&#xde;    Þ
&#xdf;    ß
&#xe6;    æ
&#xf0;    ð
&#xf8;    ø
&#xfe;    þ
&#x110;    Đ
&#x111;    đ
&#x126;    Ħ
&#x127;    ħ
&#x131;    ı
&#x141;    Ł
&#x142;    ł
&#x14a;    Ŋ
&#x14b;    ŋ
&#x152;    Œ
&#x153;    œ
&#x166;    Ŧ
&#x167;    ŧ
&#x180;    ƀ
&#x197;    Ɨ
&#x1b5;    Ƶ
&#x1b6;    ƶ
&#x1e4;    Ǥ
&#x1e5;    ǥ

Is there a different set of flags for this function that will yield theresult I'm looking for? If this function cannot do that, what is thebest way to convert all of these outlying characters? I need thisconversion to happen in only one element of my XML, not the entire XMLdocument. I can't use translate because it's a one-to-one conversionthat doesn't cover the ligatures listed above. If normalize-unicodecannot be made to cover all the characters listed above, cancharacter-maps be applied that act specifically on only one element?


Thanks,
Joni



On 10/24/14 9:11 AM, Eliot Kimber ekimber(_at_)contrext(_dot_)com wrote:

I can't restrain my own pedantry: the correct term is "numeric character
reference", not "numeric entity": http://www.w3.org/TR/REC-xml/#dt-charref

Given that I think I'm the only person who ever uses the term correctly
and consistently, we probably should have just used "numeric entity" but
so it goes.

Cheers,

E.
—————
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 10/23/14, 4:13 PM, "Graydon graydon(_at_)marost(_dot_)ca"
<xsl-list-service(_at_)lists(_dot_)mulberrytech(_dot_)com> wrote:

On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames 
jdames(_at_)inera(_dot_)com
scripsit:

Thanks for the advice! The <xsl:value-of

select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''
),'NFKC')"
/> function works for most of the entities, but it's still missing a
couple dozen characters.

Terminology pedant time --

&#x00e9; is a numeric entity and exactly the same thing as é just
written differently.

&eacute; is a named entity reference (which had better be defined
somewhere)

Either, as soon as the XML document is parsed, turns into U+00E9 in some
internal representation and they're not different from each other or the
representation for é if someone had typed that directly in the utf-8
input file.

So when you say "entity" here I'm getting the nervous feeling that I
don't know what you mean; can you provide some examples?

Some of the author names still have unicode entities instead of plain
ascii (for example, several characters with a stroke, several
ligatures, thorn characters, upper and lowercase). Is there a

Well, examples would be good, but thorn, for example, &#x00FE; which is
the self-same code point as þ, doesn't involve a modifier; it's one
whole letter that doesn't exist inside ASCII.

Stripping the modifiers -- which will give you e from é if you decompose
é first, because then it's e + ˊ, which you could write &#x0065; +
&#x0301; and it would be the same -- doesn't do anything because there
is no modifier there, it's just the single code-point for thorn.

variation of this function or a parameter that will catch and convert
ALL of these to plain ascii, as well as the standard acute and cedil
characters? Or do I need to address these outlying characters with
something else (not translate, since I can't use a one-to-one
replacement for ligature entities)?

ASCII, strictly, is seven-bit; there are lots of things you can't
represent in ASCII.  &#x00e9; *is not* ASCII just because those eight
characters all happen to be ASCII characters.

So it sounds like you're trying to (either) map U+00FE, þ, to &thorn; or
something like that (which is not, I cannot stress too much, ASCII; it
might be an ASCII representation of a non-ASCII code-point, but it's
still a non-ASCII code-point) or have þ decompose into t+h or something
of that ilk.  (Which is at least actually ASCII.)

Either way you'd have to use character mappings for those; there aren't
any modifiers to remove.

Are you really compelled to deliver seven bit ASCII?

And, please, some examples.

-- Graydon



--
Jonina Dames
Customer Support Specialist
Inera Inc.
+1 617 932 1932
eXtyles on Twitter <https://twitter.com/extyles>
jdames(_at_)inera(_dot_)com

-----------------------------------------------------------------

This email message and any attachments are confidential. If you are notthe intended recipient, please immediately reply to the sender or call617-932-1932 and delete the message from your email system. Thank you.

-------------------------------------------------------------------
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--