perl-unicode

Re: Encode::Tcl Mistery Solved!

2002-01-29 15:07:59
Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> writes:
Encode::Tcl is too slow - even for 8-bit - which is why I wrote the
engine which works from the "compiled" form.

Have you tried using ext/Encode/compile to build an XS module for
EUC ?

The example above on my FreeBSD box, Pentium III 800 MHz and
512MB RAM took some two seconds to show the result (Its performance is
not too bad once the internal table is full).

If I had _ANY_ test data I would run the compiled test and give you
the comparative number.

  You can use t/table.euc under Jcode module for instance.  table.utf8 
in my code example is just a utf8 version thereof. That's a data which 
contains all characters defined in EUC (well, actually JISX0212 is not 
included but very few environments can display JISX0212).

It is realy great to have some valid data!

For a start it has found a bug in :encoding layer - knew there must be some...
(I think I have rediscovered the multi-byte char spanning buffer boundary 
bug ... which I could not reproduce before)

But avoiding that with this script:

use Encode;
use Encode::Tcl;

open(my $jp,"<","table.euc") || die "Cannot open table.euc:$!";
my $text = join('',<$jp>);  
close($jp);
my $enc  = find_encoding('euc-jp');
if ($enc)
 {
  my $uni = $enc->decode($text,1); 
  if (length $text)
   {
    die "Failed to translate";
   }
  open(my $un,">:utf8","table.utf8") || die "Cannot open table.utf8:$!";
  print $un $uni;
  close($un);
 }

I get 

nick(_at_)bactrian 624$ time ../../perl -I../../lib try2
 
real    0m1.389s
user    0m1.370s
sys     0m0.020s
nick(_at_)bactrian 624$             

And file is binary identical against running linux iconv.

If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to : 

use Encode::EUC_JP;

I get 

nick(_at_)bactrian 626$ time ../../perl -I../../lib try2
real    0m0.197s
user    0m0.170s
sys     0m0.030s
nick(_at_)bactrian 626$             

Which is still worse than: 

nick(_at_)bactrian 626$ time iconv -f EUC-JP -t UTF-8 table.euc > expected
 
real    0m0.026s
user    0m0.010s
sys     0m0.020s
nick(_at_)bactrian 627$    

But IO is sub-optimal.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

<Prev in Thread] Current Thread [Next in Thread>