perl-unicode

Re: is it utf8 or unicode?

2005-03-16 03:36:11
On Wed, Mar 16, 2005 at 10:23:01AM +0000, 
unicode(_at_)ftumsh(_dot_)demon(_dot_)co(_dot_)uk wrote:

LANG is set to en_GB.
With some messing about I have managed to create an en_GB.utf8.
Setting LANG to that makes no difference to the perl output, as does setting 
LC_ALL.
Mind you, I should hope it wouldn't as :raw ignores locale, apparently.

In a nutshell, the code below should put \xc3\x84 into the output file and
not \xc4 as it is doing. Well, I presume it should and no one is saying 
otherwise.

No, it shouldn't put the bytes \xc3\x84 into the file
(Except on perl 5.8.0 with a UTF8 locale, or 5.8.1 or later run with the
correct -C flag to say "pay attention to a UTF8 locale". 5.8.0's behaviour
was documented, but found to be undesirable)

#!/usr/bin/perl -w
use Encode(_utf8_on);
my $data = "\xC3\x84";
_utf8_on($data);
open FH, ">aa";
print FH $data ;
print length($data);

As is, except for the cases noted above, the file handle is assumed to be
8 bit, not UTF8. Perl 5 makes the assumption (arguably wrong, but we're stuck
with it now) that 8 bit file handles would like ISO-8859-1, and writes out
your characters as ISO-8859-1.

If you do this

#!/usr/bin/perl -w 
use Encode(_utf8_on); 
my $data = "\xC3\x84"; 
_utf8_on($data); 
open FH, ">aa"; 
binmode FH, ":utf8";
print FH $data ; 
print length($data); 

or this

#!/usr/bin/perl -w 
use Encode(_utf8_on); 
my $data = "\xC3\x84"; 
_utf8_on($data); 
open FH, ">:utf8", "aa"; 
print FH $data ; 
print length($data); 

to tell perl that the file handle is expecting UTF8 rather than the default,
then you get a 2 byte file output.

Nicholas Clark

<Prev in Thread] Current Thread [Next in Thread>