On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
I tracked down the "problem" tkmail was/is having with iso-2022-jp.
The snag is I am using the API the way I designed it, not the way
it is reliably implemented.
When called thus:
my $decoded = $enc->decode($encoded,1);
decode is supposed to return portion it can decode, and set $encoded
to what remains.
Ah, I see. But it is pain in the arse for "doubly-encoded" encodings
like ISO-2022-JP.
Here is the problem. As you see, to decode ISO-2022-JP, we first have
to decode it into EUC-JP. And ISO-2022-JP -> EUC-JP is treated (and
should be treated) purely as a CES so there is no chance for error
(unless there is a bogus escape sequence). However, errors may rise
when you try to convert the resulting EUC-JP stream to UTF-8.
The problem is that not all of the possible code points in JIS X 0208
and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 are
used in 0208 and 6072 are used in 0212. So the remainder won't map to
Unicode.
It was possible to use jis02*-raw instead of EUC-JP but that
implementation was too slow because you have to invoke encode() chunk by
chunk. in fact I tried and it got 3 times as slow.
And what is a sense of "what remain" gets moot when it comes to
ISO-2022. Suppose you got a string like this;
abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
^^error occurs here.
What's the remaining stream?
ghijklmn<ESC-to-ascii>opqrstu....
is WRONG because we are now in jis0208 chunk and escape sequence is
already stripped. Do we have to go like
<ESC-to-jis0208>ghijklmn<ESC-to-ascii>opqrstu....
but that slows down the encoder too much. I just woke up. Let me
think about this a little bit more....
Dan the Encode Maintainer