On Thu, Oct 23, 2014 at 08:39:11PM -0000, Jonina Dames
jdames(_at_)inera(_dot_)com
scripsit:
Thanks for the advice! The <xsl:value-of
select="normalize-unicode(replace(normalize-unicode(.,'NFKD'),'\p{Mn}',''),'NFKC')"
/> function works for most of the entities, but it's still missing a
couple dozen characters.
Terminology pedant time --
é is a numeric entity and exactly the same thing as é just
written differently.
é is a named entity reference (which had better be defined
somewhere)
Either, as soon as the XML document is parsed, turns into U+00E9 in some
internal representation and they're not different from each other or the
representation for é if someone had typed that directly in the utf-8
input file.
So when you say "entity" here I'm getting the nervous feeling that I
don't know what you mean; can you provide some examples?
Some of the author names still have unicode entities instead of plain
ascii (for example, several characters with a stroke, several
ligatures, thorn characters, upper and lowercase). Is there a
Well, examples would be good, but thorn, for example, þ which is
the self-same code point as þ, doesn't involve a modifier; it's one
whole letter that doesn't exist inside ASCII.
Stripping the modifiers -- which will give you e from é if you decompose
é first, because then it's e + ˊ, which you could write e +
́ and it would be the same -- doesn't do anything because there
is no modifier there, it's just the single code-point for thorn.
variation of this function or a parameter that will catch and convert
ALL of these to plain ascii, as well as the standard acute and cedil
characters? Or do I need to address these outlying characters with
something else (not translate, since I can't use a one-to-one
replacement for ligature entities)?
ASCII, strictly, is seven-bit; there are lots of things you can't
represent in ASCII. é *is not* ASCII just because those eight
characters all happen to be ASCII characters.
So it sounds like you're trying to (either) map U+00FE, þ, to þ or
something like that (which is not, I cannot stress too much, ASCII; it
might be an ASCII representation of a non-ASCII code-point, but it's
still a non-ASCII code-point) or have þ decompose into t+h or something
of that ilk. (Which is at least actually ASCII.)
Either way you'd have to use character mappings for those; there aren't
any modifiers to remove.
Are you really compelled to deliver seven bit ASCII?
And, please, some examples.
-- Graydon
--~----------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
EasyUnsubscribe: http://lists.mulberrytech.com/unsub/xsl-list/1167547
or by email: xsl-list-unsub(_at_)lists(_dot_)mulberrytech(_dot_)com
--~--