perl-unicode

Re: Encode::XS for CJK

2002-01-31 08:13:58

Nick Ing-Simmons <nick(_dot_)ing-simmons(_at_)elixent(_dot_)com> wrote:

Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
First, thank you all for perl(_at_)14503(_dot_)

On 2002.01.30, at 07:07, Nick Ing-Simmons wrote:
If I run the compile script on it and build Encode::EUC_JP
as an XS extension and change Encode::Tcl to ....

  I also made Encode::JP::SHIFTJIS, with Encode::EUC_JP as a template
(Also relocated Encode::EUC_JP to Encode::JP:EUC_JP) and it also
worked.  I have a feeling this will work for other CJK.
  Now the problem is escape-based codings such as ISO-2022.

Can you explain the way those work?
I can imagine two ways for decode:
A  - keep going with current sub-encoding till we get a fail,
     then look at next few octets for an escape sequence.
B. - Scan ahead for next escape sequence (or end of available input)
     then translate up to that.

A. Is easy - but as all escape sequences seem to be valid ASCII does not
   work.
B. requires an irritating double scan.

Encode::Tcl is non-A non-B (but B would be better).

If the next byte is an "escape",
   then invokes a new sub-encoding (CCS);
else
   decodes and converts it to unicode.

The escape octets for Encode::Tcl::Escape are ESC, SI, and SO;
those for Encode::Tcl::Extend are SS2 ("\x8E") and SS3 ("\x8F");
that for Encode::Tcl::HanZi is '~', the tilde.

Any octet sequence till the next "escape" octet
could be tried to translate.

For encode there is a different pain. For each code point we need an
efficent way to find out whether a sub-encoding can represent that
point. A bit map of 0x10FFFF entries does not seem good, so it is
either an auxillary table, or try-it-and-see (which should not be too bad
with C version).

Encode::Tcl tries and sees.

--
Nick Ing-Simmons
http://www.ni-s.u-net.com/

SADAHIRO Tomoyuki

<Prev in Thread] Current Thread [Next in Thread>