On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
I tracked down the "problem" tkmail was/is having with iso-2022-jp.
The snag is I am using the API the way I designed it, not the way
it is reliably implemented.
When called thus:
my $decoded = $enc->decode($encoded,1);
decode is supposed to return portion it can decode, and set $encoded
to what remains.
Ah, I see. But it is pain in the arse for "doubly-encoded" encodings
Here is the problem. As you see, to decode ISO-2022-JP, we first have
to decode it into EUC-JP. And ISO-2022-JP -> EUC-JP is treated (and
should be treated) purely as a CES so there is no chance for error
(unless there is a bogus escape sequence). However, errors may rise
when you try to convert the resulting EUC-JP stream to UTF-8.
The problem is that not all of the possible code points in JIS X 0208
and JIS X 0212 are actually used (94x94 = 8836). of which only 6884 are
used in 0208 and 6072 are used in 0212. So the remainder won't map to
It was possible to use jis02*-raw instead of EUC-JP but that
implementation was too slow because you have to invoke encode() chunk by
chunk. in fact I tried and it got 3 times as slow.
And what is a sense of "what remain" gets moot when it comes to
ISO-2022. Suppose you got a string like this;
^^error occurs here.
What's the remaining stream?
is WRONG because we are now in jis0208 chunk and escape sequence is
already stripped. Do we have to go like
but that slows down the encoder too much. I just woke up. Let me
think about this a little bit more....
Dan the Encode Maintainer