Re: is it utf8 or unicode?

At 8:03 pm +0000 9/3/05, unicode(_at_)ftumsh(_dot_)demon(_dot_)co(_dot_)uk 
wrote:

here's my perl -V

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:


So ignore anything you've been told about previous versions.

Basically I have xC3 x84 and let perl think it is utf-8.
It is valid utf-8 ie A with diaresis.


Yes.

I don't understand what the [UTF8 "\x{c4}"] is telling me. xc4 isnot valid utf-8. It is however valid unicode as xc4 is a precomposedchar. What's worse is that the output file contains xc4 and not theutf-8 sequence I expected.

The script below will result in two identical files both containingtwo bytes "\xC3" and "\x84". If you read them raw you will get twocharacters. If you read them as UTF-8 you will get a single characterA with diaeresis. If you read them as UCS-2 you will get the singlecharacter HANGUL SYLLABLE SSE. How you read them and how you displaythem with make no difference to the content of the files.



#!/usr/bin/perl -w
use strict;
binmode STDOUT, "utf8";   # then try omitting this
my $fin = "/tmp/in.txt";  my $fout = "/tmp/out.txt";
# Create a test file to read
open FIN, ">$fin" or die $!;
print FIN "\xC3\x84";  # write two bytes to $fin
close FIN;
# Get the text from $fin
open FIN, "<:raw", $fin ;  # then try omitting the  ' "<:raw", '
my $text = <FIN>;
close FIN;
# Print $text utf-8 encoded to $fout
open FOUT,  ">$fout";
print FOUT $text;
close FOUT;
# Read $fout as UTF-8
open FOUT, "<:utf8", $fout;
$text = <FOUT>;
close FOUT;
print "YES, I AM \\x{00C4}\n" if $text eq "\x{00C4}";
print $text. "....", length $text, $/;
# Read $fout as raw bytes
 open FOUT, "<:raw", $fout;
$text = <FOUT>;
close FOUT;
print $text. "....", length $text, $/;
# See what the system thinks
my $output =  `cat $fout`;
print $output, "....", length $output;