perl-unicode

Re: iso-2022-jp problem

2002-04-15 08:02:11
Dan Kogai <dankogai(_at_)dan(_dot_)co(_dot_)jp> writes:
On Monday, April 15, 2002, at 07:29 , Nick Ing-Simmons wrote:
I tracked down the "problem" tkmail was/is having with iso-2022-jp.
The snag is I am using the API the way I designed it, not the way
it is reliably implemented.

When called thus:

my $decoded = $enc->decode($encoded,1);

decode is supposed to return portion it can decode, and set $encoded
to what remains.

Ah,  I see.  But it is pain in the arse for "doubly-encoded" encodings
like ISO-2022-JP.

Here is the problem.  As you see, to decode ISO-2022-JP, we first have
to decode it into EUC-JP.  And ISO-2022-JP -> EUC-JP is treated (and
should be treated) purely as a CES so there is no chance for error
(unless there is a bogus escape sequence).  However, errors may rise
when you try to convert the resulting EUC-JP stream to UTF-8.

The problem is that not all of the possible code points in JIS X 0208
and JIS X 0212 are actually used (94x94 = 8836).  of which only 6884 are
used in 0208 and 6072 are used in 0212.  So the remainder won't map to
Unicode.

It was possible to use jis02*-raw instead of EUC-JP but that
implementation was too slow because you have to invoke encode() chunk by
chunk.  in fact I tried and it got 3 times as slow.

And what is a sense of "what remain" gets moot when it comes to
ISO-2022.  Suppose you got a string like this;

abcd<ESC-to-jis0208>cdefghijklmn<ESC-to-ascii>opqrstu....
                        ^^error occurs here.

What's the remaining stream?

ghijklmn<ESC-to-ascii>opqrstu....

Does not matter for that case.
"does not map" is a fatal error with $chk true (and would have 
become a replacement char if $chk was false).

What matters is being able to tell the complete case, from partial case.

 A. When you have converted whole thing set remains to ''.
 B. When you have a partial encoding consume as much as you can
    and leave "string" with what is partial.

e.g.

abcd<ESC-to-jis0208>cdefghijklmn<ESC-to  -ascii>opqrstu....
                                       ^- buffer boundary

Then you return translation of 
"abcd<ESC-to-jis0208>cdefghijklmn"
and set "remains" to "<Esc-to"
so that :encoding can append "-ascii>opqrstu....                              

If you cannot do that then don't return or consume anything
so :encoding can keep appending till you have whole file but that 
is going to be very memory hungry.

-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/



<Prev in Thread] Current Thread [Next in Thread>