On Sunday 26 October 2003 01:27 am, Marco Baroni wrote:
Thanks for your quick reply!
When you look at the file and you see
a c with cedilla, can you tell whether is this actually the
appropriate character, based on its context? Is this true
of all such characters?
I do not see a c with cedilla, I see a rhombus with a question
mark inside (which is the way my shell displays non-ASCII
characters). I guess it is a c with cedilla from the context.
Which one? In ISO 8859-1, that would be
C7 uppercase C-cedilla
E7 lowercase c cedilla
but in MacRoman it would be something else, and there are other
possibilities.
Lowercase is far more common (in French, for example), but I make
no assumptions about the language of the text.
So, I would like to ask you or anybody else: is there some
kind of tool (e.g., a text editor) that I could use to
discover which encoding is being used? (I tried with emacs but
failed).
I don't have specific links, but this has been a topic of
discussion on the Unicode mailing list. There is software that
uses various heuristics to identify the character set and
encoding of text files and streams. It doesn't distinguish the
various 8-bit character sets, so I don't think it would help
you.
In simple cases like this, however, a hex editor is probably
sufficient. There are many that show the value of each byte in a
file along with one or more possible interpretations (binary,
octal, and as a character or number of oe or another length in
either little-endian or big-endian order). On Linux I use
Khexedit. There are numerous such editors for Mac and Windows as
well, including those in the Norton Utilities.
The most likely case is that your file is in ISO 8859-1 or one of
Microsoft's Windows code page extensions, both using the codes
given above.
Thanks again.
You're welcome.
Marco
---
Marco Baroni
University of Bologna
http://sslmit.unibo.it/~baroni
--
Edward Cherlin, Simputer Evangelist
Encore Technologies (S) Pte. Ltd.
Computers for all of us
http://www.simputerland.com, http://cherlin.blogspot.com