perl-unicode

Re: encoding(UTF16-LE) on Windows

2011-01-20 16:09:32
[RE: encoding(UTF16-LE) on Windows]
Jan Dubois schrieb am 20.01.2011 um 12:45 (-0800):
On Thu, 20 Jan 2011, Michael Ludwig wrote:
Erland Sommarskog schrieb am 20.01.2011 um 08:29 (-0000):
"Jan Dubois" (jand(_at_)activestate(_dot_)com) writes:
You need to stack the I/O layers in the right order.  The :encoding()
layer needs to come last (be at the bottom of the stack), *after* the
:crlf layer adds the additional carriage returns.  The way to pop the
default :crlf layer is to start out with the :raw pseudo-layer:

  open(my $fh, ">:raw:encoding(UTF-16LE):crlf", $filename) or die $!;

Certainly not anywhere close to intuitive. And the explanation is even
more muddy. "Needs to come last" - it is smack in the middle. "after
the :crlf layer" - it comes before.

The explanation makes sense; so much so that I overlooked the fact that
this is simply not how it works. Luckily, you were being vigilant. :-)

Would you mind explaining how it is *not* working the way I
described it above?

Sorry - it works exactly the way you described above. I didn't read
properly. I got confused by the uniform look of real and pseudo layers.
The :raw pseudo layer is not a layer, but rather, as you write, an
instruction to clear the stack, like this:

  :raw                  -> clear()
  :encoding(UTF-16LE)   -> push( encoding(UTF-16LE) )
  :crlf                 -> push( crlf )

I was *wrongly* thinking this, as if :raw were another layer, and not a
clearing instruction:

  :raw                  -> push( raw )                  # wrong!
  :encoding(UTF-16LE)   -> push( encoding(UTF-16LE) )
  :crlf                 -> push( crlf )

Regarding your explanation:

I realize that the fact that layers work as a "stack" may be
confusing, which is why I annotated "last" with "bottom of the stack".
Of course the one last on the stack is the first in the list of layers
passed to open() because stacks are LIFO (last in/first out):

   :raw                - clears the existing :crlf layer from the stack
                         could have used :pop instead, but :raw is more robust

   :encoding(UTF-16LE) - pushes the :encoding layer to the stack.  This makes
                         it the last layer on the stack (and also still the
                         first, for now).

   :crlf               - pushes the :crlf layer on the stack.  :encoding is
                         still the last layer, but :crlf is now the first.

Now when you print a string to the filehandle, then it will be passed
to the top-most layer first (:crlf), which will s/\n/\r\n/g on the
string, and then passes it on to the next lower layer :encoding, which
will do the encoding, and when it reaches the bottom of the stack the
data is actually written to the filesystem.

Files opened on Windows already have the :crlf layer pushed by default,
so you somehow need to get the :encoding layer *below* it.  If
you have it on top, then the crlf substitution happens *after* the
encoding, leading to incorrect data.

I think you've clarified it for all eternity.

What would be the best place to add your explanation to the docs?

http://perldoc.perl.org/functions/binmode.html
http://perldoc.perl.org/functions/open.html
http://perldoc.perl.org/perlunicode.html
http://perldoc.perl.org/PerlIO.html

Judging from existing content, I think PerlIO would be a good place for
this addition. It already has a lot of great information. However, it
starts going medias in res instead of first providing an overview and
introducing the stack picture. This could be improved.

On the downside, it is buried in the Modules Section. And the title [1]
is just too technical and might scare novice readers away.

Can you think of a better place for your user-friendly doc addition? You
obviously know the docs far better than I do … :-)

[1] PerlIO - On demand loader for PerlIO layers
    and root of PerlIO::* name space

-- 
Michael Ludwig