At 1:12 am +0200 26/10/03, Marco Baroni wrote:
I am new to (explicit) unicode handling, and right now I am facing
this problem.
I have some data (lots of data) that in theory should be in ascii
(with entity references in place of non-ascii characters). I have
no easy way to get to know exactly how these data were generated.
Presumably you have some idea what OS the files were created on. If
they are MacRoman files, us-ascii or not, then you might try
something like this. The first part of the script simply creates a
sample file for testing purposes:
#!/usr/bin/perl -w
# write some MacRoman to file some.txt
my $text = "/tmp/some.txt" ;
open TEXT, ">$text" ;
print TEXT 'œ∑鮆¥üîøπ' ;
##### `open -a 'SimpleText' $text` ; # if you like
close TEXT;
#
#
use encoding "MacRoman", STDOUT => "utf8";
my $top = q(<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Some chars</title>);
my $html = "/tmp/some.html";
open HTML, ">:encoding(utf8)", "$html";
print HTML $top; # write contents of some.txt to html file as utf-8
open TEXT, "<:encoding(MacRoman)", "$text" ;
for (<TEXT>) {
s~∑~S~g ;
print HTML;
}
`open -a 'Safari' $html` ;