Re: use Encode; # on Japanese; LONG!


On 2002.01.10, at 15:18, Jarkko Hietaniemi wrote:

Be certain to pick up the latest devel snapshot from:

ftp://ftp.funet.fi/pub/languages/perl/snap/

It's changed quite a bit since 5.7.2.  Not that much in the Encode
department, unfortunately (I think).  Some of Sadahiro's patches
went in since 5.7.2, that much I can see.

Bad news. It's gotten worse on the latest DEVEL14150. It completelyignores 2byte chars. Here is the detailed research.

I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150(5.7.2 didn't just compile on FreeBSD; I think it's a know fact).


# first let's see if conventional method works
perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8
# table.euc is a euc-jp encoded text that contains all ascii, JISX0201
# (aka Hankaku Kana) and JISX0208
iconv -f euc-jp -t utf8 table.euc  > iconv.utf8
iconv -f utf8 -t euc-jp table.utf8 > iconv.euc

> diff -u table.euc iconv.euc
--- table.euc   Wed Nov 15 14:46:44 2000
+++ iconv.euc   Thu Jan 10 19:03:58 2002
@@ -8,7 +8,7 @@
 0xa0c0:
 0xa0e0:
 0xa1a0:   　、。，．・：；？！゛゜´｀¨＾￣＿ヽヾゝゞ〃仝々〆〇ー―‐
-0xa1c0: ；焉憼€繊臓叩帖邸董如函鼻福法漫諭痢蓮弌僉辧咫圈奸屐廖悄戞據曄棔檗\xDE
+0xa1c0: _〜‖｜…‥‘’“”（）〔〕［］｛｝〈〉《》「」『』【】＋−±
 0xa1e0: ÷＝≠＜＞≦≧∞∴♂♀°′″℃￥＄¢£％＃＆＊＠§☆★○●◎◇
 0xa2a0:   ◆□■△▲▽▼※〒→←↑↓〓                      ∈∋⊆⊇⊂
 0xa2c0: ∪∩                ∧∨¬⇒⇔∀∃                      ∠⊥⌒

(Don't worry; Sadahiro-san can read it). This difference isacceptable; This is due to the fact that Jcode preserves ASCII part[\x00-\x7e] untouched while iconv faithfully uses conversion table ofUnicode Consortium ("Zenkaku Backslash" (That is, backslash that ismapped in JIS0208) back to ASCII backslash. With respect to mapping2byte char back to ASCII, virtually no Japanese like that so I madeJcode to leave ASCII alone. That behavior can be overridden by setting$Jcode::Unicode::Pedantic = 1) In short, both Jcode and iconv areacceptable on daily use.


Now comes Encode module of 5.7.2

# see the previous mail for classic.pl
../classic.pl -d table.euc camel572.utf8
../classic.pl -e table.utf8 camel572.euc

Voila! diff -u table.utf8 camel572.utf8 gives me an empty string! Theyare completely identical. Bad news is that encoding back to euc is thetrash. Half way it would be it worked.

Now DEVEL14150. Decode worked fine like 5.7.2 but when you try toencode from utf8 to euc-jp, perl croaks with;

euc-jp '[non-printable garbage]' does not map to UTF-8 at/home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pmline 228


Now I am tempted to implement toplevel Encode myself....

Also, 5.7.2 and its variants appear pretty unstable. Let me see ifEncode itself can work on 5.6.1 as well (should be, it's under ext/directory after all. A little tweak on compile scripte would be needed,however).


Dan the Man with Too Many Charsets to Handle