Re: PS (Malformed UTF-8 character)


On Sunday, Oct 26, 2003, at 01:12 Europe/Rome, Marco Baroni wrote:

I have some data (lots of data) that in theory should be in ascii  
(with entity references in place of non-ascii characters). I have no  
easy way to get to know exactly how these data were generated.

[snip]

I looked at a few of the corresponding lines, and they all have some  
character that is beyond the ASCII range, and that was not converted  
into an entity reference (for example, a c with cedilla, and the like).


Looks like your theory about the input data being "in ascii (with entity 
references...)" is contradicted by the evidence.

So now you need to determine what character encoding is being used for
the non-ascii codes, which are obviously present in the data.  When you
look at the file and you see a c with cedilla, can you tell whether is
this actually the appropriate character, based on its context?  Is this 
true of all such characters?

If so, figure out what code page is being used by the tool that is
showing you the c with cedilla, and use either PerlIO::encoding when you
open the input file (to decode the file's character set into utf8 as it
is being read) or the "decode" function of the Encode module (to do the
codepage-to-utf8 conversion after reading the raw data from the file).

        Dave Graff