perl-unicode

Re: select a variable as stdout and utf8 flag behaviour

2016-11-09 14:35:11
On Wednesday 09 November 2016 19:46:46 Gert Brinkmann wrote:
Pali, thank you very much for your answer. I am using the
Encode::decode('UTF-8', ...) function now instead of touching the
flag. Though I am not sure if a routine becomes better (more robust)
if it accepts utf8 instead the stricter utf-8. Or if it is better if
it only accepts strict utf-8?

'UTF-8' (with hyphen) is strict UTF-8. UTF8, utf8 (without hyphen) is 
non-strict perl's extended utf8.

What to use, depends on your needs... I would really suggest to use 
strict UTF-8 when doing data exchange and sending or receiving data 
to/from world.

On 09.11.2016 16:20, pali(_at_)cpan(_dot_)org wrote:
Fix is really simple. Either decode utf8 octets in $html back to
wide characters (via utf8::decode($html)) or tell STDOUT that it
does not expect wide strings, but raw octets (= remove binmode
STDOUT, ":utf8";) line.

Again... think about it, why both my proposed fixes are working.

I am near to understand it. But I wonder why I have to think about
utf-8 in this case? I expected that perl is doing it right
automagically:

I open the filehandle to write into the variable using
:encoding(UTF-8). So perl should know what it is storing inside the
variable. If I print this to STDOUT (binmoded to utf-8) it should
automatically print the content of the variable the right way.

String is just sequence of characters. And character in just number. In 
C language (char), on disk, or in other storage is character 8bit. In 
perl it can be up-to 64bit (if you have 64bit perl). And in perl that 
number represent Unicode code point. So 0x100 is LATIN CAPITAL LETTER A 
WITH MACRON, 0xFE is LATIN SMALL LETTER THORN, ...

UTF-8 is transformation which convert between sequence of Unicode code 
points and sequence of 8bit numbers. And Encode::encode('UTF-8', $str) 
just take sequence of (wide-unicode) numbers from $str, convert them to 
UTF-8 sequence and returns sequence of 8bit numbers. String is just 
sequence of numbers, so perl thinks about that returned scalar as string 
(which now has different meaning).

:encoding(UTF-8) or :utf8 layers just do automatic encoding/decoding of 
written/read data. Same as if you call encode/decode manually 
before/after print/read.

If you look at your code again it can be rewritten as:

  use strict;
  use utf8;
  use Encode;
  use FileHandle;
  my $html = '';
  open(my $fh, '>', \$html);
  my $orig_stdout = select( $fh );
  print Encode::encode('UTF-8', "Ümläut Test ßaß; 使用下列语言\n");
  select( $orig_stdout );
  $fh->close();
  print Encode::encode('UTF-8', $html);

I just used explicit encode calls, instead implicit (which are hidden in 
:utf8 resp. :encoding(UTF-8) layers).

Look at it again, you encoded string two times! Encoding is done when 
you write to FH and decoding when you read from FH.

UTF-8 encoder takes sequence of numbers (range 0x00..0x10FFFF minus some 
disallowed) and returns another sequence of numbers (range 0x00..0xFF). 
And if you call it two times, then that you got something which is two 
times encoded = garbage.

So why does it not know about the content being utf-8?

Because perl strings scalars are always treated as sequence of numbers 
and each number represent one (unicode) character. Perl scalar does not 
anything that it is "raw" (e.g. it is sequence of UTF-8) or normal.

If I am using
"use utf8" and define an utf-8 data containing variable in the source
code, perl knows to handle this the correct way, too, without the
need to decode anything manually.

use utf8 tells perl that string constants are wide unicode strings.

Take an example:

  use utf8;
  my $str = "使用下列语";

is equivalent to:

  my $str = "\x{0x4F7F}\x{0x7528}\x{0x4E0B}\x{0x5217}\x{0x8BED}";


Example without utf8:

  my $str = "使用";

is equivalent to:

  my $str = "\x{0xE4}\x{0xBD}\x{0xBF}\x{0xE7}\x{0x94}\x{0xA8}";

In this case input source file was parsed as 8bit file and string 
contains different characters.

Probably perl does not know about the content of the variable.

You can say that. It really does not know if variable contains sequence 
of ISO-8859-1 numbers, or sequence of UTF-8 numbers or Unicode code 
points... It always think and treat variable as sequence of Unicode code 
points. And if you store something else into it, that is your 
responsibility.

Only
the filehandle is set to write utf-8 data. The content of the
variable is only bytes, similar to a file that I am writing bytes
into.

But with binmode STDOUT, ":utf8"; you said that data which are you going 
to write are *not* raw and perl must first encode them to UTF-8. So it 
is expected that $html is not raw (as you already did).

If I read the file again, I have to open it as utf-8.
Alternatively I guess I can open it as raw bytes and decode the data
afterwards to utf-8? The latter way would be the same as the
decoding of the variable content?

If you still do not see how it works, then forgot about existence of 
perlio layers and write explicit encode/decode calls. After that if you 
fully understand how and where to call encode/decode, you can replace 
those explicit encode/decode calls by implicit via perlio layers.


... I hope this helps you ...