Re: iso-2022-jp problem

On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:

I tracked down the "problem" tkmail was/is having with iso-2022-jp.
The snag is I am using the API the way I designed it, not the way
it is reliably implemented.

When called thus:

my $decoded = $enc->decode($encoded,1);

decode is supposed to return portion it can decode, and set $encoded
to what remains.

Ah, I see. But it is pain in the arse for "doubly-encoded" encodingslike ISO-2022-JP.

Here is the problem. As you see, to decode ISO-2022-JP, we first haveto decode it into EUC-JP. And ISO-2022-JP -> EUC-JP is treated (andshould be treated) purely as a CES so there is no chance for error(unless there is a bogus escape sequence). However, errors may risewhen you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 areused in 0208 and 6072 are used in 0212. So the remainder won't map toUnicode.

It was possible to use jis02*-raw instead of EUC-JP but thatimplementation was too slow because you have to invoke encode() chunk bychunk. in fact I tried and it got 3 times as slow.

And what is a sense of "what remain" gets moot when it comes toISO-2022. Suppose you got a string like this;


abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
                        ^^error occurs here.

What's the remaining stream?

ghijklmn<ESC-to-ascii>opqrstu....

is WRONG because we are now in jis0208 chunk and escape sequence isalready stripped. Do we have to go like


<ESC-to-jis0208>ghijklmn<ESC-to-ascii>opqrstu....

but that slows down the encoder too much. I just woke up. Let methink about this a little bit more....


Dan the Encode Maintainer