perl-unicode

Re: perl, unicode and databases (mysql)

2002-08-21 13:22:09

----- Original Message -----
From: "Tim Bunce" <Tim(_dot_)Bunce(_at_)pobox(_dot_)com>
To: "Merijn van den Kroonenberg" <merijnspam(_at_)e-factory(_dot_)nl>
Subject: Re: perl, unicode and databases (mysql)


On Tue, Aug 20, 2002 at 04:50:18PM +0200, Merijn van den Kroonenberg
wrote:
Thank you for the answer,

I did some experimenting with the Devel::Peek module and i found the
cause
of my problem.
I was using the DBI $DBHANDLE->quote($astring); method to quote (and
slash)
strings that i put in the database. Unfortunately this method is not
unicode
safe, and my data got corrupted. It looks like the data gets utf encoded
twice. I wrote a temporary function to slash my data, but i would rather
use
the DBI method if possible. I have the feeling that this problem can be
solved in some way, maybe someone can explain what is most likely
causing
the problem, and if i can do something to make it unicode safe (without
having to modify the DBI module). If its not possible let me know too,
then
i just keep the temp function i use now ;-)

In general the quote() method should be as aware of utf8 as the
database is.  If the database supports utf8 then the quote() method
should do-the-right-thing or else it's broken and needs fixing.

Well, when i quote it manually:

############################################################
# utf8_quote(string)
sub utf8_quote($){
  my $astring = shift;
  $astring =~ s/(['"\\\0])/\\$1/g;
  return "'".$astring."'";
}# utf8_quote
############################################################

Then i can store and retrieve it just fine. So i guess it supports utf8 ;-)


Oh yeah, one other thing, since Encode::_utf8_on is a internal function,
wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As
far
as i can see, it should do exactly the same, but if i am mistaken, let
me
know :)

Encode::_utf8_on *just* sets the internal uft8 flag bit on the value
which *must* be already valid uft8 (or else you'll get problems later).

I believe Encode::decode is different (but I've never used either and
could easily not know what I'm talking about :)

from perldoc Encode
 CAVEAT: When you run "$string = decode("utf8",
         $octets)", then $string may not be equal to $octets.
         Though they both contain the same data, the utf8 flag
         for $string is on unless $octets entirely consists of
         ASCII data (or EBCDIC on EBCDIC machines).  See "The
         UTF-8 flag" below.

Thats why i got that idea, so i wondered, cause it also seems to set the
utf8 flag, and leave the data alone. Not sure tho.



Tim.

Thank you for the swift reply,

Merijn van den Kroonenberg


Thank you,
Merijn van den Kroonenberg


----- Original Message -----
From: "SADAHIRO Tomoyuki" <bqw10602(_at_)nifty(_dot_)com>
To: "Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl>
Cc: <perl-unicode(_at_)perl(_dot_)org>
Sent: Thursday, August 15, 2002 3:12 PM
Subject: Re: perl, unicode and databases (mysql)



On Tue, 13 Aug 2002 14:09:37 +0200
"Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl> wrote:

Hi all,

I have a perl application (perl 5.8.0) which puts utf8 data in a
mysql
database. This seems to work pretty well, and the retrieving of the
data
with perl also works. Using something like this:

my $sth = $db_handle->prepare("SELECT some query");
$sth->execute;
my @row=$sth->fetchrow_array;
print $row[0]."\n"; #### print before
if ($]>5.007){
  require Encode;
  Encode::_utf8_on($row[0]);}
print $row[0]."\n"; #### print after
$sth->finish;

The Encode utf8_on gives me back good data. As far as i understood
the
_utf8_on method doesnt do any real conversions, but only switches
the
utf
flag of a perl string?

If you compare the two prints in above example, then it seems that
after
the
utf flag is set the string is utf decoded. This results in the
correct
string, so it seems the original string is utf encoded (double
encoded,
since it already was UTF).

When i select the same string manually (mysql prompt) or with PHP,
then
i
get back the double encoded string. So it seems to me that the
double
encoded format is how perl stores it internally (and also in the
database)?
But this doesnt sound right to me...this would mean that everytime a
utf
flagged string is used it would need to be utf decoded. That sounds
not
very
effecient to me, so i doubt its done that way. But meanwhile i have
no
idea
how its done...and how its stored in the database.

As you might have guessed i want to access the data i put in the
database
with PHP, but i get back double utf encoded data there. The problem
could be
in alot of different places, for example my fetching in PHP, storing
in
perl
and maybe somewhere else where i have some faulty conversion. To
check
if
the data in the database is correct i tried to figure out how perl
works
with the data.

Maybe someone could put me on the right track, because this got me
mighty
confused ;-)

To look what Perl's scalar holds,
use Devel/Peek.pm.

#!perl
use Devel::Peek;
use Encode;

our $camel_utf8 = "\351\247\261\351\247\235";

print STDERR "* _utf8_on\n\n";
Encode::_utf8_on($camel_utf8);
Dump($camel_utf8);

print STDERR "\n";

print STDERR "* _utf8_off\n\n";
Encode::_utf8_off($camel_utf8);
Dump($camel_utf8);

__END__

The output is like this.
The difference between _on and _off is found in FLAGS.

* _utf8_on

SV = PV(0x1661c60) at 0x166cccc
  REFCNT = 1
  FLAGS = (POK,pPOK,UTF8)
  PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8
"\x{99f1}\x{99dd}"]
  CUR = 6
  LEN = 7

* _utf8_off

SV = PV(0x1661c60) at 0x166cccc
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x16db4e0 "\351\247\261\351\247\235"\0
  CUR = 6
  LEN = 7



SADAHIRO Tomoyuki





<Prev in Thread] Current Thread [Next in Thread>