perl-unicode

Re: select a variable as stdout and utf8 flag behaviour

2016-11-09 09:20:24
On Wednesday 09 November 2016 15:55:47 Gert Brinkmann wrote:
Hello,

...

This prints out the utf8 characters corrupted. You have to flag the
Variable after writing into it with Encode::_utf8_on() as utf8 to make
it work correctly. (So activate the commented line.)

Using this _utf8_on() usually means that I am doing something wrong.

Yes, that is truth! You should never use _utf8_on/_utf8_off/is_utf8
functions! They are here *only* for dealing with buggy XS modules. Not
for pure perl code... In pure perl code you must *not* care about UTF8
flag.

Is there a better way to achieve the correct behaviour?

Of course! When you think that you need to use Encode::_utf8_on() then
use utf8::decode() instead (or Encode::decode('UTF-8', ...)). Similarly
utf8::encode (or Encode::encode('UTF-8, ...)) instead of
Encode::_utf8_off().

Btw. there was a change in the behaviour between perl v5.14.2 and
v5.20.2: In older perl versions you could do a

my $html = '';
Encode::_utf8_on($html);

before opening the file handle onto this variable. In newer perl
versions the utf8 flag is reset on open() and print() to the variable's
file handle.

UTF8 flag just indicate if internal encoding of perl scalar is Latin1 or
UTF8. But it is internal any Latin1 string can be represented either in
Latin1 (without UTF8 flag) or in UTF-8 (with UTF8 flag). You should not
care about internal representation in pure perl code. Any perl function
at any time can convert scalar between these two encoding if it is
possible (for ASCII and Latin1 charsets).

(Btw, on EBCDIC platforms, UTF8 flag indicate that internal encoding is
UTFEBCDIC or EBCDIC, not UTF-8!!, so really do not depend on UTF8 flag!)

And to your question, here is explanation of your source code:

-----------------------------------------------------
use strict;
use utf8;

Now source code is expected to be in utf8 and perl strings are treated
as wide characters.

use Encode;
use FileHandle;

binmode STDOUT, ":utf8";

Now printing to STDOUT handle accept wide characters (>= 0xFF) and
convert output to utf8 octets. So your terminal should be configured to
accept and show UTF-8 sequences correctly.


my $html = '';

#-- open filehandle to write into the $html variable as utf8
open(my $fh, '>:encoding(UTF-8)', \$html);

Now printing to $fh accept wide characters and convert printed
characters to utf8 octets before storing them to $html. It means that
$html will *always* contains sequence of numbers which represent utf8
sequences.

my $orig_stdout = select( $fh );


print "Ümläut Test ßaß; 使用下列语言\n";

Now you have string with wide characters and this print will send this
string to $html. In $html you have sequence of octets which contains
encoded form of that wide string.



select( $orig_stdout );
$fh->close();

#You need to activate this line to make utf8 output correct
#Encode::_utf8_on($html);

print $html;

And now you send sequence of utf8 octets to STDOUT which expect wide
characters those are converted to utf8 octets. So what you get is double
encoded utf8 sequence.

Now stop, and think about it why this is truth!

-----------------------------------------------------

Fix is really simple. Either decode utf8 octets in $html back to wide
characters (via utf8::decode($html)) or tell STDOUT that it does not
expect wide strings, but raw octets (= remove binmode STDOUT, ":utf8";)
line.

Again... think about it, why both my proposed fixes are working.



Btw, perl does not use UTF-8 encoding, but perl's extended utf8. If you
want strict UTF-8, use ":encoding(UTF-8)" layer. Layers ":utf8" or
":encoding(utf8)" (without hyphen) are those non-strict perl's extended
utf8 encodings. Also utf8::encode/decode are non-stricts...