Re: perl, unicode and databases (mysql)


----- Original Message -----
From: "Tim Bunce" <Tim(_dot_)Bunce(_at_)pobox(_dot_)com>
To: "Merijn van den Kroonenberg" <merijnspam(_at_)e-factory(_dot_)nl>
Subject: Re: perl, unicode and databases (mysql)

On Tue, Aug 20, 2002 at 04:50:18PM +0200, Merijn van den Kroonenberg

wrote:

Thank you for the answer,

I did some experimenting with the Devel::Peek module and i found the

cause

of my problem.
I was using the DBI $DBHANDLE->quote($astring); method to quote (and

slash)

strings that i put in the database. Unfortunately this method is not

unicode

safe, and my data got corrupted. It looks like the data gets utf encoded
twice. I wrote a temporary function to slash my data, but i would rather

use

the DBI method if possible. I have the feeling that this problem can be
solved in some way, maybe someone can explain what is most likely

causing

the problem, and if i can do something to make it unicode safe (without
having to modify the DBI module). If its not possible let me know too,

then

i just keep the temp function i use now ;-)


In general the quote() method should be as aware of utf8 as the
database is.  If the database supports utf8 then the quote() method
should do-the-right-thing or else it's broken and needs fixing.


Well, when i quote it manually:

############################################################
# utf8_quote(string)
sub utf8_quote($){
  my $astring = shift;
  $astring =~ s/(['"\\\0])/\\$1/g;
  return "'".$astring."'";
}# utf8_quote
############################################################

Then i can store and retrieve it just fine. So i guess it supports utf8 ;-)

Oh yeah, one other thing, since Encode::_utf8_on is a internal function,
wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As

far

as i can see, it should do exactly the same, but if i am mistaken, let

me

know :)


Encode::_utf8_on *just* sets the internal uft8 flag bit on the value
which *must* be already valid uft8 (or else you'll get problems later).

I believe Encode::decode is different (but I've never used either and
could easily not know what I'm talking about :)


from perldoc Encode
 CAVEAT: When you run "$string = decode("utf8",
         $octets)", then $string may not be equal to $octets.
         Though they both contain the same data, the utf8 flag
         for $string is on unless $octets entirely consists of
         ASCII data (or EBCDIC on EBCDIC machines).  See "The
         UTF-8 flag" below.

Thats why i got that idea, so i wondered, cause it also seems to set the
utf8 flag, and leave the data alone. Not sure tho.


Tim.


Thank you for the swift reply,

Merijn van den Kroonenberg

Thank you,
Merijn van den Kroonenberg


----- Original Message -----
From: "SADAHIRO Tomoyuki" <bqw10602(_at_)nifty(_dot_)com>
To: "Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl>
Cc: <perl-unicode(_at_)perl(_dot_)org>
Sent: Thursday, August 15, 2002 3:12 PM
Subject: Re: perl, unicode and databases (mysql)


On Tue, 13 Aug 2002 14:09:37 +0200
"Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl> wrote:

Hi all,

I have a perl application (perl 5.8.0) which puts utf8 data in a

mysql

database. This seems to work pretty well, and the retrieving of the

data

with perl also works. Using something like this:

my $sth = $db_handle->prepare("SELECT some query");
$sth->execute;
my @row=$sth->fetchrow_array;
print $row[0]."\n"; #### print before
if ($]>5.007){
  require Encode;
  Encode::_utf8_on($row[0]);}
print $row[0]."\n"; #### print after
$sth->finish;

The Encode utf8_on gives me back good data. As far as i understood

the

_utf8_on method doesnt do any real conversions, but only switches

the

utf

flag of a perl string?

If you compare the two prints in above example, then it seems that

after

the

utf flag is set the string is utf decoded. This results in the

correct

string, so it seems the original string is utf encoded (double

encoded,

since it already was UTF).

When i select the same string manually (mysql prompt) or with PHP,

then

get back the double encoded string. So it seems to me that the

double

encoded format is how perl stores it internally (and also in the

database)?

But this doesnt sound right to me...this would mean that everytime a

utf

flagged string is used it would need to be utf decoded. That sounds

not

very

effecient to me, so i doubt its done that way. But meanwhile i have

no

idea

how its done...and how its stored in the database.

As you might have guessed i want to access the data i put in the

database

with PHP, but i get back double utf encoded data there. The problem

could be

in alot of different places, for example my fetching in PHP, storing

in

perl

and maybe somewhere else where i have some faulty conversion. To

check

if

the data in the database is correct i tried to figure out how perl

works

with the data.

Maybe someone could put me on the right track, because this got me

mighty

confused ;-)


To look what Perl's scalar holds,
use Devel/Peek.pm.

#!perl
use Devel::Peek;
use Encode;

our $camel_utf8 = "\351\247\261\351\247\235";

print STDERR "* _utf8_on\n\n";
Encode::_utf8_on($camel_utf8);
Dump($camel_utf8);

print STDERR "\n";

print STDERR "* _utf8_off\n\n";
Encode::_utf8_off($camel_utf8);
Dump($camel_utf8);

__END__

The output is like this.
The difference between _on and _off is found in FLAGS.

* _utf8_on

SV = PV(0x1661c60) at 0x166cccc
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8

"\x{99f1}\x{99dd}"]

  CUR = 6
  LEN = 7

* _utf8_off

SV = PV(0x1661c60) at 0x166cccc
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x16db4e0 "\351\247\261\351\247\235"\0
  CUR = 6
  LEN = 7



SADAHIRO Tomoyuki