Re: UCM file and combining character sequences

Hank Tt <news2002nov(_at_)lomaji(_dot_)com> writes:

Hi,

I'm trying to make a UCM file to feed to enc2xs.  The legacy encoding for
Taiwanese romanization *must* have its code points mapped to Unicode
character sequences, for the simple reason that the UCS lacks the
corresponding precomposed characters (and is unlikely to have them in the
future, as they are composable using existing characters from the Latin
script and the Diacritical Combining Marks blocks).  (See [1] for script
details.)

Now, IBM's ICU pages document the mapping of one Unicode to one legacy
codepoint as well as one-to-many but, apparently, not many-to-one or
many-to-many:

" In the CHARMAP section of a .ucm file, each line contains a Unicode code
point (like <U{1-6 hexadecimal digits for the code point}> ), a codepage
character byte sequence (each byte like \x{2 hexadecimal digits} ).... " [2]

How does enc2xs deal with (or intend to deal with) such a case?


It may not in its current form.

The underlying C code engine is an octet-sequence->octet-sequence 
converter. So provided the source encoding is unambiguous (without
lookahead) then it can be represented. Whether ucm can handle it is 
less clear, but I don't see why not. It too has two chunks of octets per-line.
What may need some work is the table building so that 
reverse mapping - base-char+mark return one encoded thing.

Is the ICU
specification to be followed rigidly?


No, Pragmatically - but we may not yet be handling all that ICU can
express.


Since I am very new to Perl, .any insight is appreciated.

[1] http://lomaji.com/poj/chart.html
[2] http://oss.software.ibm.com/icu/userguide/conversion-data.html

--Henry H. Tan-Tenn