jhi,
On 2002.02.22, at 07:54, Jarkko Hietaniemi wrote:
Hi,
the new JP test and the old Tcl test are doing "somewhat okay" in EBCDIC
(I'm using an OS/390 mainframe).
I wish I had an access to it...
Failed Test Stat Wstat Total Fail Failed List
of Failed
-------------------------------------------------------------------------------
...
../ext/Encode/t/JP.t 255 65280 22 16 72.73% 7-22
../ext/Encode/t/Tcl.t 137 35072 632 34 5.38%
592-598 600
602
604 606
608
610 612-
632
My problem is what to do about these failures. Especially the Tcl.t
is rather frustratingly close to success. The JP.t might be a hard
nut to crack. Should I just skip the failing tests? If so, we need
to figure out what is the pattern of the failures (hardcording by test
numbers would feel really evil...)? We might entertain the idea of
completely skipping these tests, but the relatively high success rate
seems to be saying that fixing this instead of ignoring this might be
possible.
I am yet to grok your test to the fullest extent but this much I can't
tell; Don't let the high success rate foo you; Remember 8bit part is
much smaller compared to 16bit part. If your tests attempts something
like "feed an UTF-EBCDIC to a given encoding, decode it back and see if
it matches the original", chances are MOST iso-8859-1 part is failing.
But once again, I am yet to check in full detail.
Dan, in case EBCDIC scares you (and it should :-), a quick intro:
basically, consider the whole low 256 characters being rearranged from
what they are in ASCII. For example, ord("A") is 0xC1, not 0x41. (The
pod/perlebcdic.pod has the full tables.)
Sure it does scare me. I have to confess UTF-EBCDIC was totally out
of mind. But here I got a hint; Like what perl used to be, CJK
encodings are very, very ASCII-chauvinistic; Its variable-length
encoding heavily relies on the fact that ascii leaves MSB of the byte
alone. That way you can tell if a given byte is a whole (half-width)
character or half of full-width character.
The shadow of ASCII casts even on ISO-2022, an escape-based encoding
that is not supposed to be affected by MSB and such (Only \e was
supposed to matter); in ISO-2022, most 2-byte characters are
represented by either 96x96 or 94x94 grid, which is (7bit ascii -
control characters) or (that - space (0x20) and DEL (\x7F)).
Obviously this will not work on EBCDIC....
This one may be tougher than we think....
FYI I know something called 12-bit EBCDIC kanji also exists. I know
only of existence but is that in our support list?
The test logs are attached: I would really appreciate if you could see
some pattern in the failures.
I will do the best I can but I will be away for this weekend and I
won't be back online till Sunday at least.
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen
Dan the Unstable according to Jack Cohen