perl-unicode

Re: DBI and UTF-8

2003-12-06 11:30:04
On Sat, Dec 06, 2003 at 10:30:40AM -0500, David Graff wrote:
  Now it seems as if the texts I get from DBI were encoded
  with ISO-8859-1. Could it be possible that DBI is converting
  the UTF-8 obtained from the data base to ISO-8859-1?
  Possibly it considers ISO-8859-1 to be the "default client
  charset"?  ...

I'm not personally familiar with the DBI source code, but I believe any
sort of conversion or alteration of data content by DBI should be quite
impossible (unless there is a bug in the driver for a given RDB engine).

This problem is nothing DBI-specific.

There is a fundamental asymmetry in Perl's unicode implementation:

    - Implicit conversion from bytestrings to ustrings assumes that
      the bytestrings are in Latin1.
    - Implicit conversion from ustrings to bytestrings assumes that
      the ustrings are in UTF-8.

This happens because:

    - The first 256 codepoints in Unicode happens to agree with Latin1
    - Simon Cozens, the person who implemented b=>u conversion for Perl
      on ASCII and EBCDIC platforms, happened to think that Latin1 is
      synonymous with ASCII.

Hence, you'd need to explicitly convert bytestrings returned by
DBI into ustrings, using either utf8::decode, or Encode::decode_utf8.

It's inconvenient, but it's the way it is. :-/

Maybe a way to globally force UTF8 (or some other encoding) to be used
on b=>u promotion is a good idea, but AFAIK it does not yet exist.

Thanks,
/Autrijus/

Attachment: pgpQkz6kNpDvh.pgp
Description: PGP signature

<Prev in Thread] Current Thread [Next in Thread>