perl-unicode

Re: use Encode; # on Japanese; LONG!

2002-01-10 17:49:48

On Thu, 10 Jan 2002 19:50:10 +0900
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> wrote:

Bad news.  It's gotten worse on the latest DEVEL14150.  It completely 
ignores 2byte chars.  Here is the detailed research.

I used MacOS 10.1.2 for 5.7.2 and FreeBSD 4.5-stable for DEVEL14150 
(5.7.2 didn't just compile on FreeBSD; I think it's a know fact).

# first let's see if conventional method works
perl -MJcode -ple '$_=jcode($_,'euc')->utf8' table.euc > table.utf8
# table.euc is a euc-jp encoded text that contains all ascii, JISX0201
# (aka Hankaku Kana) and JISX0208


Now comes Encode module of 5.7.2

# see the previous mail for classic.pl
../classic.pl -d table.euc camel572.utf8
../classic.pl -e table.utf8 camel572.euc

Voila!  diff -u table.utf8 camel572.utf8 gives me an empty string!  They 
are completely identical.  Bad news is that encoding back to euc is the 
trash.  Half way it would be it worked.

Now  DEVEL14150.  Decode worked fine like 5.7.2 but when you try to 
encode from utf8 to euc-jp,  perl croaks with;

euc-jp '[non-printable garbage]' does not map to UTF-8 at 
/home/dankogai/perl/lib/5.7.2/i386-freebsd-multi-64int/Encode/Tcl.pm 
line 228

I guess in that string SVf_UTF8 would be off.
This should be due to not using the UTF-8 layer.
(But "euc-jp .. does not map to UTF-8 " error message
 must be shown on decoding to unicode.)

Please refer to the PerlIO manpage for detail;
we'd declair the stream takes unicode sequence
like this: binmode(FILEHANDLE, ":utf8");
or through open() function.

Bleadperl has * many many * docs on Unicode...
perluniiintro, perlunicode, lib/utf8, etc.

I'd be glad if this would help you,

http://homepage1.nifty.com/nomenclator/perl/unicode.htm
(in Japanese)

there is a brief on Perl's Unicode support including
a bit of comparison and differences
between that of Perl 5.7 and 5.6.

Now I am tempted to implement toplevel Encode myself....

Also, 5.7.2 and its variants appear pretty unstable.  Let me see if 
Encode itself can work on 5.6.1 as well (should be, it's under ext/ 
directory after all.  A little tweak on compile scripte would be needed, 
however).

Dan the Man with Too Many Charsets to Handle

Encode::Tcl should work on Perl 5.6 as it is pure-perl,
however it's very slow, as you pointed it out,
and therefore not very practical to use.
There is much room for improvement.

Regards,
SADAHIRO Tomoyuki
URL: http://homepage1.nifty.com/nomenclator/

<Prev in Thread] Current Thread [Next in Thread>