Re: PS (Malformed UTF-8 character)

On Sunday 26 October 2003 01:27 am, Marco Baroni wrote:

Thanks for your quick reply!

When you look at the file and you see
a c with cedilla, can you tell whether is this actually the
appropriate character, based on its context?  Is this true
of all such characters?


I do not see a c with cedilla, I see a rhombus with a question
mark inside (which is the way my shell displays non-ASCII
characters). I guess it is a c with cedilla from the context.


Which one? In ISO 8859-1, that would be

C7 uppercase C-cedilla
E7 lowercase c cedilla

but in MacRoman it would be something else, and there are other 
possibilities.

Lowercase is far more common (in French, for example), but I make 
no assumptions about the language of the text.

So, I would like to ask you or anybody else: is there some
kind of tool (e.g., a text editor) that I could use to
discover which encoding is being used? (I tried with emacs but
failed).


I don't have specific links, but this has been a topic of 
discussion on the Unicode mailing list. There is software that 
uses various heuristics to identify the character set and 
encoding of text files and streams. It doesn't distinguish the 
various 8-bit character sets, so I don't think it would help 
you.

In simple cases like this, however, a hex editor is probably 
sufficient. There are many that show the value of each byte in a 
file along with one or more possible interpretations (binary, 
octal, and as a character or number of oe or another length in 
either little-endian or big-endian order). On Linux I use 
Khexedit. There are numerous such editors for Mac and Windows as 
well, including those in the Norton Utilities.

The most likely case is that your file is in ISO 8859-1 or one of 
Microsoft's Windows code page extensions, both using the codes 
given above.

Thanks again.


You're welcome.

Marco



---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni


-- 
Edward Cherlin, Simputer Evangelist
Encore Technologies (S) Pte. Ltd.
Computers for all of us
http://www.simputerland.com, http://cherlin.blogspot.com