perl-unicode

Re: Encode UTF-8 optimizations

2016-08-24 23:49:46
On 08/22/2016 02:47 PM, pali(_at_)cpan(_dot_)org wrote:

snip

I added some tests for overlong sequences. Only for ASCII platforms, tests for 
EBCDIC
are missing (sorry, I do not have access to any EBCDIC platform for testing).

It's fine to skip those tests on EBCDIC.

> > Anyway, how it behave on EBCDIC platforms? And maybe another question
> > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> > UTF-8 or to UTF-EBCDIC?
>
> It works fine on EBCDIC platforms.  There are other bugs in Encode on
> EBCDIC that I plan on investigating as time permits.  Doing this has
> fixed some of these for free.  The uvuni() functions should in almost
> all instances be uvchr(), and my patch does that.
Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think 
that it
should be different for UTF-EBCDIC.

I'll fix that

> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
> you prefer), so $str will effectively be in the version of UTF-EBCDIC
> valid for the platform it is running on (there are differences depending
> on the platform's underlying code page).
So it means that on EBCDIC platforms you cannot process file which is encoded 
in UTF-8?
As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in 
UTF-8 (as I
understood).

Yes. The two worlds do not meet. If you are on an EBCDIC platform, the native encoding is UTF-EBCDIC tailored to the code page the platform runs on.

In searching, I did not find anything that converts between the two, so I wrote a Perl script to do so. Our OS/390 man, Yaroslav, wrote one in C.