Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/DEVANAGA.TXT
##################
# Section 1: Map the following byte pairs as indicated:
# (ZWNJ means ZERO WIDTH NON-JOINER, ZWJ means ZERO WIDTH JOINER)
# (Also see note about 0xF0 in comments above)
0xA1+0xE9 0x0950 # DEVANAGARI OM
0xA6+0xE9 0x090C # DEVANAGARI LETTER VOCALIC L
0xA7+0xE9 0x0961 # DEVANAGARI LETTER VOCALIC LL
0xAA+0xE9 0x0960 # DEVANAGARI LETTER VOCALIC RR
0xDB+0xE9 0x0962 # DEVANAGARI VOWEL SIGN VOCALIC L
0xDC+0xE9 0x0963 # DEVANAGARI VOWEL SIGN VOCALIC LL
0xDF+0xE9 0x0944 # DEVANAGARI VOWEL SIGN VOCALIC RR
0xE8+0xE8 0x094D+0x200C # DEVANAGARI SIGN VIRAMA + ZWNJ #
explicit halan
t
0xE8+0xE9 0x094D+0x200D # DEVANAGARI SIGN VIRAMA + ZWJ # soft
halant
0xEA+0xE9 0x093D # DEVANAGARI SIGN AVAGRAHA
# Section 2: Map the remaining bytes as follows:
[snip]
0xA1 0x0901 # DEVANAGARI SIGN CANDRABINDU
....
0xA6 0x0907 # DEVANAGARI LETTER I
0xA7 0x0908 # DEVANAGARI LETTER II
....
0xAA 0x090B # DEVANAGARI LETTER VOCALIC R
0xA6 0x0907 # DEVANAGARI LETTER I
...
0xDB 0x093F # DEVANAGARI VOWEL SIGN I
0xDC 0x0940 # DEVANAGARI VOWEL SIGN II
0xDD 0x0941 # DEVANAGARI VOWEL SIGN U
0xDE 0x0942 # DEVANAGARI VOWEL SIGN UU
0xDF 0x0943 # DEVANAGARI VOWEL SIGN VOCALIC R
....
0xE8 0x094D # DEVANAGARI SIGN VIRAMA # halant
....
0xEA 0x0964 # DEVANAGARI DANDA
#
Let me tell you what we have to do when we receive 0xA1. We consult
Section:1 and if the following character does match that of Section 1,
use it. If not, treat the next character as just character. In other
words, 0xA1 have to be BOTH END POINT of the page traversal and THE
POINTER TO the next page. The current encengine is not desinged that
way.
Er, not 100% sure about that. I certainly considered that case at one
point (encengine came from trie code I had used elsewhere for keyword
matching - the original could match (say) 'lst' and 'ls' by having
it backtrack if next thing was not a 't').
It is likely that enc2xs cannot build such a table though.
It must be EITHER.
One easy way to overcome this is that we make a mock doublebyte map
for 0xA1 and others, with the following page including all cases. Since
MacDevanagari is originally a single-byte encoding, this is still
possible without bloating the UCM.
Dan the Encode Maintainer
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/