ram(_at_)zedat(_dot_)fu-berlin(_dot_)de said:
As far as I know, the data base engine stores text using UTF-8.
...
It would be worthwhile to use some other mode of access to confirm this.
It's possible that non-utf8 text data are being stored into tables in
some way that you don't expect or don't directly control. If some
person or process is inserting non-utf8 data into the database, it's
very unlikely that the database engine itself is doing anything to alter
the data (e.g. to convert it to utf8) -- database engines don't do that.
To say that it "stores text using UTF-8" would simply mean that it
handles character data types in a manner that is "8-bit-clean" -- it
won't screw-up or alter characters that happen to have the high-bit
set, and when queried, will always return exactly what was inserted.
Now it seems as if the texts I get from DBI were encoded
with ISO-8859-1. Could it be possible that DBI is converting
the UTF-8 obtained from the data base to ISO-8859-1?
Possibly it considers ISO-8859-1 to be the "default client
charset"? ...
I'm not personally familiar with the DBI source code, but I believe any
sort of conversion or alteration of data content by DBI should be quite
impossible (unless there is a bug in the driver for a given RDB engine).
Data going to or from a database is supposed to pass through DBI without
modification of any sort.
How can I get the utf-8 text stored in the data base?
If you have a utf8-encoded string and put this into a table via an
insert or update operation, that specific byte sequence should be
retrievable from the table later on, via a normal query.
If you are encountering a situation where you are specifically inserting
a utf8 character string, and are then getting back something different
when you query for that string, you should contact the author of the
dbi:ADO driver module. Again, it will be helpful to use other methods
of access to the database so that you can get a better idea of where the
data corruption is happening.
Dave Graff