perl-unicode

Re: is it utf8 or unicode?

2005-03-16 08:18:24
At 8:03 pm +0000 9/3/05, unicode(_at_)ftumsh(_dot_)demon(_dot_)co(_dot_)uk 
wrote:

here's my perl -V

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:

So ignore anything you've been told about previous versions.

Basically I have xC3 x84 and let perl think it is utf-8.
It is valid utf-8 ie A with diaresis.

Yes.

I don't understand what the [UTF8 "\x{c4}"] is telling me. xc4 is not valid utf-8. It is however valid unicode as xc4 is a precomposed char. What's worse is that the output file contains xc4 and not the utf-8 sequence I expected.

The script below will result in two identical files both containing two bytes "\xC3" and "\x84". If you read them raw you will get two characters. If you read them as UTF-8 you will get a single character A with diaeresis. If you read them as UCS-2 you will get the single character HANGUL SYLLABLE SSE. How you read them and how you display them with make no difference to the content of the files.


#!/usr/bin/perl -w
use strict;
binmode STDOUT, "utf8";   # then try omitting this
my $fin = "/tmp/in.txt";  my $fout = "/tmp/out.txt";
# Create a test file to read
open FIN, ">$fin" or die $!;
print FIN "\xC3\x84";  # write two bytes to $fin
close FIN;
# Get the text from $fin
open FIN, "<:raw", $fin ;  # then try omitting the  ' "<:raw", '
my $text = <FIN>;
close FIN;
# Print $text utf-8 encoded to $fout
open FOUT,  ">$fout";
print FOUT $text;
close FOUT;
# Read $fout as UTF-8
open FOUT, "<:utf8", $fout;
$text = <FOUT>;
close FOUT;
print "YES, I AM \\x{00C4}\n" if $text eq "\x{00C4}";
print $text. "....", length $text, $/;
# Read $fout as raw bytes
 open FOUT, "<:raw", $fout;
$text = <FOUT>;
close FOUT;
print $text. "....", length $text, $/;
# See what the system thinks
my $output =  `cat $fout`;
print $output, "....", length $output;



<Prev in Thread] Current Thread [Next in Thread>