perl-unicode

Re: PS (Malformed UTF-8 character)

2003-10-26 17:30:04

baroni(_at_)sslmit(_dot_)unibo(_dot_)it said:
I see a rhombus with a question mark  inside (which is the way my
shell displays non-ASCII characters). I  guess it is a c with cedilla
from the context.

So, I would like to ask you or anybody else: is there some kind of
tool  (e.g., a text editor) that I could use to discover which
encoding is  being used?

The first thing to do is get a hexadecimal dump of the data, to see what
the actual byte sequence is.  The unix "od" utility is good for this,
and I think emacs has a mode for viewing data in hex.  Once you see what
byte codes are being used to represent a c-cedilla (and/or other
non-ascii characters that are clearly inferable from context), you scan
through the various cross-mapping code tables that are available for
inspection or download at unicode.org (http://www.unicode.org/Public/
MAPPINGS/).

As a clue, if you see a two-byte sequence for each accented character, 
whereas the plain-ascii characters are all single-byte, then the data 
is probably in utf8 (another clue for this is that the first byte of 
each multi-byte character will always have the same value for a given 
language).

On the other hand, if all characters appear to be single-byte, you'll
need to look for the name of the inferred character (e.g. "LATIN SMALL
LETTER C WITH CEDILLA") in the various cross-mapping tables, and
determine which table has the appropriate byte code that matches your
data for this character.

        Dave G.


<Prev in Thread] Current Thread [Next in Thread>