Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
First, thank you all for perl(_at_)14503(_dot_)
On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to ....
I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
worked. I have a feeling this will work for other CJK.
Now the problem is escape-based codings such as ISO-2022.
Can you explain the way those work?
I can imagine two ways for decode:
A - keep going with current sub-encoding till we get a fail,
then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
then translate up to that.
A. Is easy - but as all escape sequences seem to be valid ASCII does not
work.
B. requires an irritating double scan.
For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).
Another small problem is that XS-based encoding consumes a whole
directory immediately under perl/ext/Encode. Well, I can live with a
few dozens more.
You could bundle several encodings in one XS (the way Encode itself
bundles ASCII, ios-8859-* and koi8).
If any of the bundled encodings have similar sequences of code points
then we will get overall table size reductions too.
In the limit one could have Encode::CJK, but perhaps
Encode::JP / Encode::CN / Encode::KR makes more sense ???
And the speed of the compile script may be a problem if we want all
CJK to be XS-based. It roughly takes about 25 seconds to compile single
CJK encoding on my FreeBSD box. Well, I can live with that too but
other porters may find it frustrating....
We could ship things pre-compiled (with origianal .ucm's gzipped, or
provide a way to extract a .ucm from the compiled form).
Also the compile process is all in perl and has not really been tunned.
It spends a lot of time trying to find common "strings" (which gets tables
down in size so is a win.)
I think we are making a significant progress in CJK....
Dan
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/