perl-unicode

Re: List of unsupported unicode characters?

2007-01-10 14:46:55
Hello

U+00A0 is not a UTF-8 character. The UTF-8 pendant for U+00A0 is C2 A0. 
What's interesting here is that A0 is part of the UTF-8 Sequence. So if that 
file is UTF-8, perl misses further bytes in the sequence. Otherwise it might 
not be UTF-8.

Regards,

Oliver

Am Mittwoch, 10. Januar 2007 19:59 schrieb John Costello:
On Wed, 10 Jan 2007, Paul Bijnens wrote:
On 2007-01-10 08:10, John Costello wrote:
Is there a list of utf8 characters that perl cannot map, for example
"\xA0"?  This is with Perl 5.8.3.

AFAIK there is no problem with "\xA0" if you mean the "\xA0" in
latin1 (iso8819-1) or similar encodings.  That is just the "no-break
space".

Yes, that is the character I mean, though it is ISO-8859 (I seem to recall
that one is a subset of the other).

What exactly is your problem with that character?

perl 5.8.3 complains

      utf8 "\xA0" does not map to Unicode

when the file is read.  I'm specifying open(INFILE,
"<:encoding($this->{'encoding'})", $this->{filename}), where
$this->{'encoding'} is set to utf8 (confirmed that).

The file originally was generated by perl 5.6.1 with utf encoding
specified via binmode.  The file then was tarred, gzipped, scp'd, and
ungzipped and untarred and fed to perl 5.8.3.

Thanks to Darren for the pointer to perldelta and the Unicode versions.  I
see that Unicode 4.0.0 does support \xA0, as well as the 110 other
characters that perl 5.8.3 complains about.

If I drop the encoding statement and change the open command to
      open(INFILE, "<$this->{'filename'}"

the errors disappear.

..

This leads me to think that perl 5.6.1 isn't encoding the output into
utf8, but that's a bit of a wild guess.