Perl Encode Hackers,
I have been annoyed by the fact that Encode::JP is yet to support
JISX0212-1990. Though this charset is hardly used, this is official
part of euc-jp today as well as iso-2022-jp. Without it, euc-jp support
is hardly complete.
Desperately I reviewed Nick's compile implementation once again and
found that there is no reason compile cannot handle 3-byte code. "Oh
man!" I shouted, because JIXX0212 in euc-jp is represented as 3-byte,
0x8F + (jisx0212 & 0x8080).
I have created a ucm file called euc-jp+0212.ucm that looks like this;
> diff -u Encode/euc-jp.ucm Encode/euc-jp+0212.ucm | less
--- Encode/euc-jp.ucm Tue Mar 12 04:56:36 2002
+++ Encode/euc-jp+0212.ucm Tue Mar 19 17:51:32 2002
@@ -1,7 +1,7 @@
# compile -o Encode/euc-jp.ucm Encode/euc-jp.enc
<code_set_name> "euc-jp"
<mb_cur_min> 1
-<mb_cur_max> 2
+<mb_cur_max> 3
<subchar> \x3F
#
CHARMAP
@@ -210,7 +210,6 @@
<UFF9D> \x8E\xDD |0 # HALFWIDTH KATAKANA LETTER N
<UFF9E> \x8E\xDE |0 # HALFWIDTH KATAKANA VOICED SOUND MARK
<UFF9F> \x8E\xDF |0 # HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
-<U008F> \x8F |0 # <control>
<U0090> \x90 |0 # <control>
<U0091> \x91 |0 # <control>
<U0092> \x92 |0 # <control>
@@ -7106,4 +7105,6149 @@
<U7464> \xF4\xA4 |0 # CJK Ideograph
<U51DC> \xF4\xA5 |0 # CJK Ideograph
<U7199> \xF4\xA6 |0 # CJK Ideograph
+<U00A1> \x8F\xA2\xC2 |0 # CJK Ideograph
....
That is,
* <mb_cur_max> is now 3, instead of 2
* \x8F is no longer control character, but the first byte of 3-byte
represented jisx0212.
* The rest of table I have grabbed out of Jcode (Jcode/Unicode/table.h)
and modified JP/Makefile.PL so it uses new table. Voila! It worked!
Since Encode/JP/JIS.pm and Encode/JP/ISO_2022_JP is already coded to
handle jisx0212 (if euc-jp supports that), it automagically adds
jisx0212 support to other encodings as well
I need to fix pod and t/JP.t so it tests 0212 part but I will upload
new Encode package within 24 hours.
Thank you Nick for making compile this smart!
Dan the Man with a New Encoding