perl-unicode

Re: Encode::XS for CJK

2002-01-31 01:14:13
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
First, thank you all for perl(_at_)14503(_dot_)

On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to ....

  I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
worked.  I have a feeling this will work for other CJK.
  Now the problem is escape-based codings such as ISO-2022.

Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.

A. Is easy - but as all escape sequences seem to be valid ASCII does not
   work.
B. requires an irritating double scan.

For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).


  Another small problem is that XS-based encoding consumes a whole
directory immediately under perl/ext/Encode.  Well, I can live with a
few dozens more.

You could bundle several encodings in one XS (the way Encode itself
bundles ASCII, ios-8859-* and koi8).
If any of the bundled encodings have similar sequences of code points
then we will get overall table size reductions too.

In the limit one could have Encode::CJK, but perhaps
Encode::JP / Encode::CN / Encode::KR makes more sense ???

  And the speed of the compile script may be a problem if we want all
CJK to be XS-based.  It roughly takes about 25 seconds to compile single
CJK encoding on my FreeBSD box.  Well, I can live with that too but
other porters may find it frustrating....

We could ship things pre-compiled (with origianal .ucm's gzipped, or
provide a way to extract a .ucm from the compiled form).
Also the compile process is all in perl and has not really been tunned.
It spends a lot of time trying to find common "strings" (which gets tables
down in size so is a win.)

  I think we are making a significant progress in CJK....

Dan
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/


<Prev in Thread] Current Thread [Next in Thread>