Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> wrote:
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
First, thank you all for perl(_at_)14503(_dot_)
On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to ....
I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
worked. I have a feeling this will work for other CJK.
Now the problem is escape-based codings such as ISO-2022.
Can you explain the way those work?
I can imagine two ways for decode:
A - keep going with current sub-encoding till we get a fail,
then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
then translate up to that.
A. Is easy - but as all escape sequences seem to be valid ASCII does not
work.
B. requires an irritating double scan.
Encode::Tcl is non-A non-B (but B would be better).
If the next byte is an "escape",
then invokes a new sub-encoding (CCS);
else
decodes and converts it to unicode.
The escape octets for Encode::Tcl::Escape are ESC, SI, and SO;
those for Encode::Tcl::Extend are SS2 ("\x8E") and SS3 ("\x8F");
that for Encode::Tcl::HanZi is '~', the tilde.
Any octet sequence till the next "escape" octet
could be tried to translate.
For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).
Encode::Tcl tries and sees.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/
SADAHIRO Tomoyuki