Re: Encode::XS for CJK


Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> wrote:

Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:

First, thank you all for perl(_at_)14503(_dot_)

On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:

If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to ....


  I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
worked.  I have a feeling this will work for other CJK.
  Now the problem is escape-based codings such as ISO-2022.


Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.

A. Is easy - but as all escape sequences seem to be valid ASCII does not
   work.
B. requires an irritating double scan.


Encode::Tcl is non-A non-B (but B would be better).

If the next byte is an "escape",
   then invokes a new sub-encoding (CCS);
else
   decodes and converts it to unicode.

The escape octets for Encode::Tcl::Escape are ESC, SI, and SO;
those for Encode::Tcl::Extend are SS2 ("\x8E") and SS3 ("\x8F");
that for Encode::Tcl::HanZi is '~', the tilde.

Any octet sequence till the next "escape" octet
could be tried to translate.

For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).


Encode::Tcl tries and sees.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/


SADAHIRO Tomoyuki

<Prev in Thread]	Current Thread	[Next in Thread>
Re: Encode::Tcl Mistery Solved!, (continued) Re: Encode::Tcl Mistery Solved!, Nick Ing-Simmons Encode and CGI, Dan Kogai Re: Encode::Tcl Mistery Solved!, Dan Kogai Re: Encode::Tcl Mistery Solved!, Nick Ing-Simmons Re: Encode::Tcl Mistery Solved!, Nick Ing-Simmons Encode::XS for CJK, Dan Kogai Re: Encode::XS for CJK, Jarkko Hietaniemi Re: Encode::XS for CJK, Nick Ing-Simmons Re: Encode::XS for CJK, Dan Kogai Re: Encode::XS for CJK, Nick Ing-Simmons Re: Encode::XS for CJK, SADAHIRO Tomoyuki <=