Re: Encode test problems in EBCDIC

jhi,

On 2002.02.22, at 07:54, Jarkko Hietaniemi wrote:

Hi,

the new JP test and the old Tcl test are doing "somewhat okay" in EBCDIC
(I'm using an OS/390 mainframe).


  I wish I had an access to it...

Failed Test Stat Wstat Total Fail Failed Listof Failed

-------------------------------------------------------------------------------
...
../ext/Encode/t/JP.t               255 65280    22   16  72.73%  7-22

../ext/Encode/t/Tcl.t 137 35072 632 34 5.38%592-598 600602604 606608610 612-

                                                                 632

My problem is what to do about these failures.  Especially the Tcl.t
is rather frustratingly close to success.  The JP.t might be a hard
nut to crack.  Should I just skip the failing tests?  If so, we need
to figure out what is the pattern of the failures (hardcording by test
numbers would feel really evil...)?  We might entertain the idea of
completely skipping these tests, but the relatively high success rate
seems to be saying that fixing this instead of ignoring this might be
possible.

I am yet to grok your test to the fullest extent but this much I can'ttell; Don't let the high success rate foo you; Remember 8bit part ismuch smaller compared to 16bit part. If your tests attempts somethinglike "feed an UTF-EBCDIC to a given encoding, decode it back and see ifit matches the original", chances are MOST iso-8859-1 part is failing.But once again, I am yet to check in full detail.

  Dan, in case EBCDIC scares you (and it should :-), a quick intro:
  basically, consider the whole low 256 characters being rearranged from
  what they are in ASCII.  For example, ord("A") is 0xC1, not 0x41. (The
  pod/perlebcdic.pod has the full tables.)

Sure it does scare me. I have to confess UTF-EBCDIC was totally outof mind. But here I got a hint; Like what perl used to be, CJKencodings are very, very ASCII-chauvinistic; Its variable-lengthencoding heavily relies on the fact that ascii leaves MSB of the bytealone. That way you can tell if a given byte is a whole (half-width)character or half of full-width character.The shadow of ASCII casts even on ISO-2022, an escape-based encodingthat is not supposed to be affected by MSB and such (Only \e wassupposed to matter); in ISO-2022, most 2-byte characters arerepresented by either 96x96 or 94x94 grid, which is (7bit ascii -control characters) or (that - space (0x20) and DEL (\x7F)).

  Obviously this will not work on EBCDIC....
  This one may be tougher than we think....

FYI I know something called 12-bit EBCDIC kanji also exists. I knowonly of existence but is that in our support list?

The test logs are attached: I would really appreciate if you could see
some pattern in the failures.

I will do the best I can but I will be away for this weekend and Iwon't be back online till Sunday at least.

--
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen


Dan the Unstable according to Jack Cohen