perl-unicode

Re: iso-2022-jp encoding on EBCDIC

2005-12-20 07:12:39
On Mon, 19 Dec 2005 22:28:55 -0800 (PST), rajarshi das 
<dazio_r(_at_)yahoo(_dot_)com> wrote

I am testing this with iso-2022-jp encoding :
------------------------
use encoding 'iso-2022-jp';

$a = "^[$B$!^[(B";
print "a : $a\n";
------------------------

On linux, I get :
a : ^[^[(B 
/* Why is the '(B' shown? Isnt this just an escape
char to switch over to ASCII ? */ 

In a double-quote string, $B and $! are interpolated
as a variable;

that is $a = '^[' . $B . $! . '^[(B'; in other words,
a concatenation of literal ^[ and variable $B and variable $!
and literal ^[(B

And ^[ is CIRCUMFLEX ACCENT + LEFT SQUARE BRACKET
but not a control character ESCAPE.

On ebcdic, I get : 
Malformed UTF-8 character (unexpected end of string)
at /u/isldev2/tmp_dbg/perl-5.8.7/lib/utf8_heavy.pl
line 330.
Malformed UTF-8 character (unexpected continuation
byte 0x6a, with no preceding start byte) in pattern
match (m//) at
/u/isldev2/tmp_dbg/perl-5.8.7/lib/utf8_heavy.pl line
337.
Malformed UTF-8 character (unexpected continuation
byte 0x6a, with no preceding start byte) in pattern
match (m//) at
/u/isldev2/tmp_dbg/perl-5.8.7/lib/utf8_heavy.pl line
337.

-- and some junk data.

Seems like in "$B$!^[(B" above, $! and ^[ are
incorrect two byte sequences on ebcdic. However, $!
donot translate into printable characters on cp-1047 .
What do we replace them by ? 

Accoding to JIS X 0208:1997 Appendix 2 (that specifies ISO-2022-JP),
escape sequences for ISO 2022-JP is "\x1B\x28\x42", "\x1B\x28\x4A",
"\x1B\x24\x40", "\x1B\x24\x42".

ASCII graphic representations such as "\e$B" are not portable
to EBCDIC, nevertheless they are widely used in the ASCII world.

In EBCDIC, ESCAPE "\e" is not \x1B but \x27, DOLLAR $ is not \x24
but \x5B, CAPITAL B is not \x42 but \xC2.  Don't replace escape
sequences with corresponding graphic characters as ASCII.

If I understand it correctly, an escape sequence is a sequence of
7-bit or 8-bit combinations, but not a sequence of graphic characters;
an escape sequence is encoded neither in ASCII nor in EBCDIC.
(Though I refer to JIS X 0202, standard Japanese translation,
 instead of the original ISO/IEC 2022.)

I tested again with  : 
---------------------------------
use encoding 'iso-2022-jp';
$a = "$B&&(B"; # && is \x50\x50 on EBCDIC which is
valid acc to jis0208.ucm
print "a : $a\n";
----------------------------------

But I still get the messages as above and some junk
data in $a which I dont think is the correct o/p.

As Encode.pm is a CPAN module, perhaps bugs in it should be
reported to the maintainer of the module, rather than
the perl5-porters mailing list.

The site rt.cpan.org helps to report bugs in every distribution
released through CPAN:

    http://rt.cpan.org/NoAuth/Bugs.html?Dist=Encode

Regards,
SADAHIRO Tomoyuki


<Prev in Thread] Current Thread [Next in Thread>