perl-unicode

Re: Encode test problems in EBCDIC

2002-02-21 21:14:00
jhi,

On 2002.02.22, at 07:54, Jarkko Hietaniemi wrote:
Hi,

the new JP test and the old Tcl test are doing "somewhat okay" in EBCDIC
(I'm using an OS/390 mainframe).

  I wish I had an access to it...

Failed Test Stat Wstat Total Fail Failed List of Failed
-------------------------------------------------------------------------------
...
../ext/Encode/t/JP.t               255 65280    22   16  72.73%  7-22
../ext/Encode/t/Tcl.t 137 35072 632 34 5.38% 592-598 600 602 604 606 608 610 612-
                                                                 632

My problem is what to do about these failures.  Especially the Tcl.t
is rather frustratingly close to success.  The JP.t might be a hard
nut to crack.  Should I just skip the failing tests?  If so, we need
to figure out what is the pattern of the failures (hardcording by test
numbers would feel really evil...)?  We might entertain the idea of
completely skipping these tests, but the relatively high success rate
seems to be saying that fixing this instead of ignoring this might be
possible.


I am yet to grok your test to the fullest extent but this much I can't tell; Don't let the high success rate foo you; Remember 8bit part is much smaller compared to 16bit part. If your tests attempts something like "feed an UTF-EBCDIC to a given encoding, decode it back and see if it matches the original", chances are MOST iso-8859-1 part is failing. But once again, I am yet to check in full detail.

  Dan, in case EBCDIC scares you (and it should :-), a quick intro:
  basically, consider the whole low 256 characters being rearranged from
  what they are in ASCII.  For example, ord("A") is 0xC1, not 0x41. (The
  pod/perlebcdic.pod has the full tables.)

Sure it does scare me. I have to confess UTF-EBCDIC was totally out of mind. But here I got a hint; Like what perl used to be, CJK encodings are very, very ASCII-chauvinistic; Its variable-length encoding heavily relies on the fact that ascii leaves MSB of the byte alone. That way you can tell if a given byte is a whole (half-width) character or half of full-width character. The shadow of ASCII casts even on ISO-2022, an escape-based encoding that is not supposed to be affected by MSB and such (Only \e was supposed to matter); in ISO-2022, most 2-byte characters are represented by either 96x96 or 94x94 grid, which is (7bit ascii - control characters) or (that - space (0x20) and DEL (\x7F)).
  Obviously this will not work on EBCDIC....
  This one may be tougher than we think....
FYI I know something called 12-bit EBCDIC kanji also exists. I know only of existence but is that in our support list?

The test logs are attached: I would really appreciate if you could see
some pattern in the failures.

I will do the best I can but I will be away for this weekend and I won't be back online till Sunday at least.

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Dan the Unstable according to Jack Cohen

<Prev in Thread] Current Thread [Next in Thread>