Re: [xsl] Getting the Base Character of Character with Diacritic

Hi Jeff,

This, I think, is far from trivial, but here's a possible approach youmight want to consider.

First of all, a character with diacritic does not necessarily need to beencoded as one character. For instance, the character Ẫ can be encodedas x1EAA, as x0041x02C2x0303 (A + ^ + ~), as x00C3x0303 (Â + ~) or asx00C2x02C2 (Ã + ^), where the ~ and ^ are not the characters on yourkeyboard, but the combining diacritical marks. Also, depending on howyou look at it, Æ (x00C6) equals (is combatible with) AE.

To normalize this Unicode stuff, the Unicode Consortium has inventedfour (or five?) normalization algorithms. These infamous algorithmseither try to decompose (D) as much as possible, or compose (C) as muchas possible. Furthermore, they can try to (de)compose even further tothe compatible (K) counterparts of the characters. This gives fourvariants of Normalization Forms: NFC, NFKC, NFD, NFKD (see http://www.unicode.org/unicode/reports/tr15). The XPath 2.0 functionnormalize-unicode is used for dealing with this and also adds a fifthvariant: fully-normalized, which is NFC plus not starting with acombining character.

That's for the theory. Now practice. Processors only need to support onenormalization form, namely NFC. This is needed so you can correctlycompare two strings. What you need, is NFKD (or, to a lesser extend,NFD). Then you can translate these diacritical marks to something(x02C2, x0300, x0301m x0303, x0309, x0329, x2C9 for a start: circumflex,grave, aigu, tilde, hook, dot, macron) and you have a string withoutdiacritical marks.

As far as I can tell, your best bet is Saxon-SA, it seems to supportNFC, NFD, NFKC, and NFKD(http://www.saxonica.com/conformance/xqts100/SaxonResults.html). Notsure if it will work the way you expect it to.

Not sure how you tend to deal with Chinese, Hangul, Hebrew, Cyrillic andsome other more complex languages, because the combining characters arean important part of the letter, without which they mean nothing. Thisdepends on what you are actually after.


Cheers,

Abel Braaksma
http://abelleba.metacarpus.com


Jeff Sese wrote:

Hi,
Is there a way in xslt for me to get the base character of a characterwith diacritic?Like ā to a? I was thinking of using the translate function, but itthere are too many characters to include.
-- Jeff



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--