perl-unicode

Re: Encode::XS for CJK

2002-02-21 12:34:58
On Thu, Jan 31, 2002 at 04:19:23AM +0900, Dan Kogai wrote:
  And the speed of the compile script may be a problem if we want all 
CJK to be XS-based.  It roughly takes about 25 seconds to compile single 
CJK encoding on my FreeBSD box.  Well, I can live with that too but 
other porters may find it frustrating....

Now I've re-read this message I've just noticed that paragraph.
I did get frustrated with it.
1: It's too slow
2: It uses too much RAM. (Well, that's subjective, but my FreeBSD box only
   has 16M total, and it was not a happy bunny, swapping like crazy and taking
   over an hour to run 5 minutes worth of CPU time)

So I've been re-jigging it (and Jarkko has been commiting the improvements)
to bleadperl - not sure if you're subscribed to p5p.
By yesterday I think it was 37% faster at compiling EUC_JP, and I've found
some more things to tweak today.

[eg just found that using (unpack "n*", pack "H*", $line) makes it 2.5% faster
than (map {hex $_} $line =~ /(....)/g)
I think that that is portable to big endian, and to 64 bit]

I hope that I've not been tramping on things you've been doing. It's still
making output files that are byte-for-byte identical with what the original of
last week did.

I've got a question about FFFD. The original compile script does this:

     for (my $j = 0; $j < 16; $j++)
      {
       no strict 'refs';
       my $ech = &{"encode_$type"}($ch,$page);
       my $val = hex(substr($line,0,4,''));
       next if $val == 0xFFFD;
       if ($val || (!$ch && !$page))
        {
         my $el  = length($ech);
         $max_el = $el if (!defined($max_el) || $el > $max_el);
         $min_el = $el if (!defined($min_el) || $el < $min_el);
         my $uch = encode_U($val);
         if (exists $seen{$uch})
          {
           warn sprintf("U%04X is %02X%02X and %02X%02X\n",
                        $val,$page,$ch,@{$seen{$uch}});
          }
         else
          {
           $seen{$uch} = [$page,$ch];
          }
         enter($e2u,$ech,$uch,$e2u,0);
         enter($u2e,$uch,$ech,$u2e,0);
        }
       else
        {
         # No character at this position
         # enter($e2u,$ech,undef,$e2u);
        }
       $ch++;
      }


Is there a bug?
Should the $ch++ happen even for the cases where $val == 0xFFFD?
Currently it looks like $ch is not incremented when the input value is 0xFFFD

Nicholas Clark
-- 
EMCFT http://www.ccl4.org/~nick/CV.html

<Prev in Thread] Current Thread [Next in Thread>