perl-unicode

Re: roundtrip conversion for Mac OS CJK encodings

2003-09-28 06:30:07
Nick Ing-Simmons <nick(_at_)ing-simmons(_dot_)net> wrote:

SADAHIRO Tomoyuki <bqw10602(_at_)nifty(_dot_)com> writes:
Hello.

For round-trip fidelity, Mac OS CJK encodings include many characters
with mapping a single character in a Mac OS encoding
to a sequence of standard Unicode characters.
(cf. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/README.TXT )

In the case of Encode.pm, such characters are marked with |3
("reverse fallback", only from the encoding to Unicode, but not back),
so roundtrip conversion is not achieved.

I think I copied those markings from ICU. I am not 100% sure that fallbacks 
are "compiled" correctly, and I am not an expert on CJK stuff.
If it would be more "perlish" to make the round-trip conversion work
by default Encode.pm can be less pedantic than ICU and allow it.

Never mind.
Actually I don't work with Macintosh, so I'm not sure
what people with Mac (itself and/or its encodings) would desire.

I would like to give an example:
when handling of MacKorean via Unicode, the following behavior
is not inconsistent, though he/she might feel it strange.

#!perl
use encoding 'MacKorean';                     # (1)
our $string = "\xAA\x45\xB4\xEB\xAA\x8E";     # (2)
$string =~ s/\xB4\xEB//g;                     # (3)
print $string;                                # (4)
__END__

<RESULT>
"\x{20de}" does not map to MacKorean.
"\x{20dd}" does not map to MacKorean.
\x{20de}\x{20dd}

cf. a part of macKorean.ucm
<UB300>        \xB4\xEB |0 # hangul DAE
<UB300><U20DD> \xAA\x8E |3 # hangul DAE + COMBINING ENCLOSING CIRCLE
<UB300><U20DE> \xAA\x45 |3 # hangul DAE + COMBINING ENCLOSING SQUARE

At line (2), $string consists of three Korean characters:
"\xAA\x45", "\xB4\xEB", and "\xAA\x8E".
Someone, who thinks (3) should remove "\xB4\xEB",
should think the result (4) should be "\xAA\x45\xAA\x8E".
This "problem" must not be resolved by round-trip.

What might be a solution is:
(I don't think any of them would be very practical, though.)

(a) mapping *all* the characters in an encoding to a single
    Unicode character (e.g. to private use areas).

(b) grapheme aware operations that will distinguish \x{B300}\x{20DD}
    from \x{B300} as a grapheme.
* but \X is insufficient; it must cope with the hint characters in PUA
  (http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CORPCHAR.TXT)
  including:
0xF860  # transcoding hint: group next 2 characters # Japanese,Korean
0xF861  # transcoding hint: group next 3 characters # Japanese,Korean
0xF862  # transcoding hint: group next 4 characters # Japanese,Korean

Then /\x{F860}\p{Any}{2}/, /\x{F861}\p{Any}{3}/, /\x{F862}\p{Any}{4}/,
etc. are a single grapheme for Macintosh encodings.

(c) multibyte aware operations w/o conversion to Unicode
    (something like Jperl in old days).

(d) giving up Macintosh encodings...

regards,
SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>