perl-unicode

Re: use Encode; # on Japanese; LONG!

2002-01-10 03:50:30

On 2002.01.10, at 15:18, Jarkko Hietaniemi wrote:
Be certain to pick up the latest devel snapshot from:

ftp://ftp.funet.fi/pub/languages/perl/snap/

It's changed quite a bit since 5.7.2.  Not that much in the Encode
department, unfortunately (I think).  Some of Sadahiro's patches
went in since 5.7.2, that much I can see.

Bad news. It's gotten worse on the latest DEVEL14150. It completely ignores 2byte chars. Here is the detailed research.

I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150 (5.7.2 didn't just compile on FreeBSD; I think it's a know fact).

# first let's see if conventional method works
perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8
# table.euc is a euc-jp encoded text that contains all ascii, JISX0201
# (aka Hankaku Kana) and JISX0208
iconv -f euc-jp -t utf8 table.euc  > iconv.utf8
iconv -f utf8 -t euc-jp table.utf8 > iconv.euc

> diff -u table.euc iconv.euc
--- table.euc   Wed Nov 15 14:46:44 2000
+++ iconv.euc   Thu Jan 10 19:03:58 2002
@@ -8,7 +8,7 @@
 0xa0c0:
 0xa0e0:
 0xa1a0:    、。,.・:;?!゛゜´`¨^ ̄_ヽヾゝゞ〃仝々〆〇ー―‐
-0xa1c0: ;焉憼€繊臓叩帖邸董如函鼻福法漫諭痢蓮弌僉辧咫圈奸屐廖悄戞據曄棔檗\xDE
+0xa1c0: _〜‖|…‥‘’“”()〔〕[]{}〈〉《》「」『』【】+−±
 0xa1e0: ÷=≠<>≦≧∞∴♂♀°′″℃¥$¢£%#&*@§☆★○●◎◇
 0xa2a0:   ◆□■△▲▽▼※〒→←↑↓〓                      ∈∋⊆⊇⊂
 0xa2c0: ∪∩                ∧∨¬⇒⇔∀∃                      ∠⊥⌒

(Don't worry; Sadahiro-san can read it). This difference is acceptable; This is due to the fact that Jcode preserves ASCII part [\x00-\x7e] untouched while iconv faithfully uses conversion table of Unicode Consortium ("Zenkaku Backslash" (That is, backslash that is mapped in JIS0208) back to ASCII backslash. With respect to mapping 2byte char back to ASCII, virtually no Japanese like that so I made Jcode to leave ASCII alone. That behavior can be overridden by setting $Jcode::Unicode::Pedantic = 1) In short, both Jcode and iconv are acceptable on daily use.

Now comes Encode module of 5.7.2

# see the previous mail for classic.pl
../classic.pl -d table.euc camel572.utf8
../classic.pl -e table.utf8 camel572.euc

Voila! diff -u table.utf8 camel572.utf8 gives me an empty string! They are completely identical. Bad news is that encoding back to euc is the trash. Half way it would be it worked.

Now DEVEL14150. Decode worked fine like 5.7.2 but when you try to encode from utf8 to euc-jp, perl croaks with;

euc-jp '[non-printable garbage]' does not map to UTF-8 at /home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pm line 228

Now I am tempted to implement toplevel Encode myself....

Also, 5.7.2 and its variants appear pretty unstable. Let me see if Encode itself can work on 5.6.1 as well (should be, it's under ext/ directory after all. A little tweak on compile scripte would be needed, however).

Dan the Man with Too Many Charsets to Handle

<Prev in Thread] Current Thread [Next in Thread>