perl-unicode

Re: iso-2022-jp problem

2002-04-15 06:16:01
On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
I tracked down the "problem" tkmail was/is having with iso-2022-jp.
The snag is I am using the API the way I designed it, not the way
it is reliably implemented.

When called thus:

my $decoded = $enc->decode($encoded,1);

decode is supposed to return portion it can decode, and set $encoded
to what remains.

Ah, I see. But it is pain in the arse for "doubly-encoded" encodings like ISO-2022-JP.

Here is the problem. As you see, to decode ISO-2022-JP, we first have to decode it into EUC-JP. And ISO-2022-JP -> EUC-JP is treated (and should be treated) purely as a CES so there is no chance for error (unless there is a bogus escape sequence). However, errors may rise when you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208 and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 are used in 0208 and 6072 are used in 0212. So the remainder won't map to Unicode.

It was possible to use jis02*-raw instead of EUC-JP but that implementation was too slow because you have to invoke encode() chunk by chunk. in fact I tried and it got 3 times as slow.

And what is a sense of "what remain" gets moot when it comes to ISO-2022. Suppose you got a string like this;

abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
                        ^^error occurs here.

What's the remaining stream?

ghijklmn<ESC-to-ascii>opqrstu....


is WRONG because we are now in jis0208 chunk and escape sequence is already stripped. Do we have to go like

<ESC-to-jis0208>ghijklmn<ESC-to-ascii>opqrstu....

but that slows down the encoder too much. I just woke up. Let me think about this a little bit more....

Dan the Encode Maintainer

<Prev in Thread] Current Thread [Next in Thread>