xsl-list
[Top] [All Lists]

Re: [xsl] Getting the Base Character of Character with Diacritic

2006-09-19 00:15:05
Hi Jeff,

This, I think, is far from trivial, but here's a possible approach you might want to consider.

First of all, a character with diacritic does not necessarily need to be encoded as one character. For instance, the character Ẫ can be encoded as x1EAA, as x0041x02C2x0303 (A + ^ + ~), as x00C3x0303 (Â + ~) or as x00C2x02C2 (Ã + ^), where the ~ and ^ are not the characters on your keyboard, but the combining diacritical marks. Also, depending on how you look at it, Æ (x00C6) equals (is combatible with) AE.

To normalize this Unicode stuff, the Unicode Consortium has invented four (or five?) normalization algorithms. These infamous algorithms either try to decompose (D) as much as possible, or compose (C) as much as possible. Furthermore, they can try to (de)compose even further to the compatible (K) counterparts of the characters. This gives four variants of Normalization Forms: NFC, NFKC, NFD, NFKD (see http://www .unicode.org/unicode/reports/tr15). The XPath 2.0 function normalize-unicode is used for dealing with this and also adds a fifth variant: fully-normalized, which is NFC plus not starting with a combining character.

That's for the theory. Now practice. Processors only need to support one normalization form, namely NFC. This is needed so you can correctly compare two strings. What you need, is NFKD (or, to a lesser extend, NFD). Then you can translate these diacritical marks to something (x02C2, x0300, x0301m x0303, x0309, x0329, x2C9 for a start: circumflex, grave, aigu, tilde, hook, dot, macron) and you have a string without diacritical marks.

As far as I can tell, your best bet is Saxon-SA, it seems to support NFC, NFD, NFKC, and NFKD (http://www.saxonica.com/conformance/xqts100/SaxonResults.html). Not sure if it will work the way you expect it to.

Not sure how you tend to deal with Chinese, Hangul, Hebrew, Cyrillic and some other more complex languages, because the combining characters are an important part of the letter, without which they mean nothing. This depends on what you are actually after.

Cheers,

Abel Braaksma
http://abelleba.metacarpus.com


Jeff Sese wrote:
Hi,

Is there a way in xslt for me to get the base character of a character with diacritic? Like ā to a? I was thinking of using the translate function, but it there are too many characters to include.

-- Jeff



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe(_at_)lists(_dot_)mulberrytech(_dot_)com>
--~--