On 2002.01.10, at 15:18, Jarkko Hietaniemi wrote:
Be certain to pick up the latest devel snapshot from:
ftp://ftp.funet.fi/pub/languages/perl/snap/
It's changed quite a bit since 5.7.2. Not that much in the Encode
department, unfortunately (I think). Some of Sadahiro's patches
went in since 5.7.2, that much I can see.
Bad news. It's gotten worse on the latest DEVEL14150. It completely
ignores 2byte chars. Here is the detailed research.
I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150
(5.7.2 didn't just compile on FreeBSD; I think it's a know fact).
# first let's see if conventional method works
perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8
# table.euc is a euc-jp encoded text that contains all ascii, JISX0201
# (aka Hankaku Kana) and JISX0208
iconv -f euc-jp -t utf8 table.euc > iconv.utf8
iconv -f utf8 -t euc-jp table.utf8 > iconv.euc
> diff -u table.euc iconv.euc
--- table.euc Wed Nov 15 14:46:44 2000
+++ iconv.euc Thu Jan 10 19:03:58 2002
@@ -8,7 +8,7 @@
0xa0c0:
0xa0e0:
0xa1a0: 、。,.・:;?!゛゜´`¨^ ̄_ヽヾゝゞ〃仝々〆〇ー―‐
-0xa1c0: ;焉憼€繊臓叩帖邸董如函鼻福法漫諭痢蓮弌僉辧咫圈奸屐廖悄戞據曄棔檗\xDE
+0xa1c0: _〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−±
0xa1e0: ÷=≠<>≦≧∞∴♂♀°′″℃¥$¢£%#&*@§☆★○●◎◇
0xa2a0: ◆□■△▲▽▼※〒→←↑↓〓 ∈∋⊆⊇⊂
0xa2c0: ∪∩ ∧∨¬⇒⇔∀∃ ∠⊥⌒
(Don't worry; Sadahiro-san can read it). This difference is
acceptable; This is due to the fact that Jcode preserves ASCII part
[\x00-\x7e] untouched while iconv faithfully uses conversion table of
Unicode Consortium ("Zenkaku Backslash" (That is, backslash that is
mapped in JIS0208) back to ASCII backslash. With respect to mapping
2byte char back to ASCII, virtually no Japanese like that so I made
Jcode to leave ASCII alone. That behavior can be overridden by setting
$Jcode::Unicode::Pedantic = 1) In short, both Jcode and iconv are
acceptable on daily use.
Now comes Encode module of 5.7.2
# see the previous mail for classic.pl
../classic.pl -d table.euc camel572.utf8
../classic.pl -e table.utf8 camel572.euc
Voila! diff -u table.utf8 camel572.utf8 gives me an empty string! They
are completely identical. Bad news is that encoding back to euc is the
trash. Half way it would be it worked.
Now DEVEL14150. Decode worked fine like 5.7.2 but when you try to
encode from utf8 to euc-jp, perl croaks with;
euc-jp '[non-printable garbage]' does not map to UTF-8 at
/home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pm
line 228
Now I am tempted to implement toplevel Encode myself....
Also, 5.7.2 and its variants appear pretty unstable. Let me see if
Encode itself can work on 5.6.1 as well (should be, it's under ext/
directory after all. A little tweak on compile scripte would be needed,
however).
Dan the Man with Too Many Charsets to Handle