Re: [xsl] Getting the Base Character of Character with Diacritic
2006-09-19 00:15:05
Hi Jeff,
This, I think, is far from trivial, but here's a possible approach you
might want to consider.
First of all, a character with diacritic does not necessarily need to be
encoded as one character. For instance, the character Ẫ can be encoded
as x1EAA, as x0041x02C2x0303 (A + ^ + ~), as x00C3x0303 (Â + ~) or as
x00C2x02C2 (Ã + ^), where the ~ and ^ are not the characters on your
keyboard, but the combining diacritical marks. Also, depending on how
you look at it, Æ (x00C6) equals (is combatible with) AE.
To normalize this Unicode stuff, the Unicode Consortium has invented
four (or five?) normalization algorithms. These infamous algorithms
either try to decompose (D) as much as possible, or compose (C) as much
as possible. Furthermore, they can try to (de)compose even further to
the compatible (K) counterparts of the characters. This gives four
variants of Normalization Forms: NFC, NFKC, NFD, NFKD (see http://www
.unicode.org/unicode/reports/tr15). The XPath 2.0 function
normalize-unicode is used for dealing with this and also adds a fifth
variant: fully-normalized, which is NFC plus not starting with a
combining character.
That's for the theory. Now practice. Processors only need to support one
normalization form, namely NFC. This is needed so you can correctly
compare two strings. What you need, is NFKD (or, to a lesser extend,
NFD). Then you can translate these diacritical marks to something
(x02C2, x0300, x0301m x0303, x0309, x0329, x2C9 for a start: circumflex,
grave, aigu, tilde, hook, dot, macron) and you have a string without
diacritical marks.
As far as I can tell, your best bet is Saxon-SA, it seems to support
NFC, NFD, NFKC, and NFKD
(http://www.saxonica.com/conformance/xqts100/SaxonResults.html). Not
sure if it will work the way you expect it to.
Not sure how you tend to deal with Chinese, Hangul, Hebrew, Cyrillic and
some other more complex languages, because the combining characters are
an important part of the letter, without which they mean nothing. This
depends on what you are actually after.
Cheers,
Abel Braaksma
http://abelleba.metacarpus.com
Jeff Sese wrote:
Hi,
Is there a way in xslt for me to get the base character of a character
with diacritic?
Like ā to a? I was thinking of using the translate function, but it
there are too many characters to include.
-- Jeff
--~------------------------------------------------------------------
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--
|
|