perl-unicode

Re: Malformed UTF-8 character

2003-10-26 04:30:05
At 1:12 am +0200 26/10/03, Marco Baroni wrote:

I am new to (explicit) unicode handling, and right now I am facing this problem.

I have some data (lots of data) that in theory should be in ascii (with entity references in place of non-ascii characters). I have no easy way to get to know exactly how these data were generated.

Presumably you have some idea what OS the files were created on. If they are MacRoman files, us-ascii or not, then you might try something like this. The first part of the script simply creates a sample file for testing purposes:


#!/usr/bin/perl -w
# write some MacRoman to file some.txt
my $text = "/tmp/some.txt" ;
open TEXT, ">$text" ;
print TEXT 'œ∑鮆¥üîøπ' ;
#####  `open -a 'SimpleText' $text` ; # if you like
close TEXT;
#
#
use encoding "MacRoman", STDOUT => "utf8";

my $top = q(<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>Some chars</title>);

my $html = "/tmp/some.html";
open HTML, ">:encoding(utf8)", "$html";
print HTML $top; # write contents of some.txt to html file as utf-8
open TEXT, "<:encoding(MacRoman)", "$text" ;
for (<TEXT>) {
        s~∑~S~g ;
        print HTML;
}
`open -a 'Safari' $html` ;

<Prev in Thread] Current Thread [Next in Thread>