![]() |
perl-unicode
|
Re: perl, unicode and databases (mysql)2002-08-21 13:22:09----- Original Message ----- From: "Tim Bunce" <Tim(_dot_)Bunce(_at_)pobox(_dot_)com> To: "Merijn van den Kroonenberg" <merijnspam(_at_)e-factory(_dot_)nl> Subject: Re: perl, unicode and databases (mysql) On Tue, Aug 20, 2002 at 04:50:18PM +0200, Merijn van den Kroonenberg wrote: Thank you for the answer, I did some experimenting with the Devel::Peek module and i found the cause of my problem. I was using the DBI $DBHANDLE->quote($astring); method to quote (and slash) strings that i put in the database. Unfortunately this method is not unicode safe, and my data got corrupted. It looks like the data gets utf encoded twice. I wrote a temporary function to slash my data, but i would rather use the DBI method if possible. I have the feeling that this problem can be solved in some way, maybe someone can explain what is most likely causing the problem, and if i can do something to make it unicode safe (without having to modify the DBI module). If its not possible let me know too, then i just keep the temp function i use now ;-)In general the quote() method should be as aware of utf8 as the database is. If the database supports utf8 then the quote() method should do-the-right-thing or else it's broken and needs fixing. Well, when i quote it manually: ############################################################ # utf8_quote(string) sub utf8_quote($){ my $astring = shift; $astring =~ s/(['"\\\0])/\\$1/g; return "'".$astring."'"; }# utf8_quote ############################################################ Then i can store and retrieve it just fine. So i guess it supports utf8 ;-) Oh yeah, one other thing, since Encode::_utf8_on is a internal function, wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As far as i can see, it should do exactly the same, but if i am mistaken, let me know :)Encode::_utf8_on *just* sets the internal uft8 flag bit on the value which *must* be already valid uft8 (or else you'll get problems later). I believe Encode::decode is different (but I've never used either and could easily not know what I'm talking about :) from perldoc Encode CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the utf8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below. Thats why i got that idea, so i wondered, cause it also seems to set the utf8 flag, and leave the data alone. Not sure tho. Tim. Thank you for the swift reply, Merijn van den Kroonenberg Thank you, Merijn van den Kroonenberg ----- Original Message ----- From: "SADAHIRO Tomoyuki" <bqw10602(_at_)nifty(_dot_)com> To: "Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl> Cc: <perl-unicode(_at_)perl(_dot_)org> Sent: Thursday, August 15, 2002 3:12 PM Subject: Re: perl, unicode and databases (mysql)On Tue, 13 Aug 2002 14:09:37 +0200 "Merijn van den Kroonenberg" <merijn(_at_)e-factory(_dot_)nl> wrote:Hi all, I have a perl application (perl 5.8.0) which puts utf8 data in a mysql database. This seems to work pretty well, and the retrieving of the data with perl also works. Using something like this: my $sth = $db_handle->prepare("SELECT some query"); $sth->execute; my @row=$sth->fetchrow_array; print $row[0]."\n"; #### print before if ($]>5.007){ require Encode; Encode::_utf8_on($row[0]);} print $row[0]."\n"; #### print after $sth->finish; The Encode utf8_on gives me back good data. As far as i understood the _utf8_on method doesnt do any real conversions, but only switches the utfflag of a perl string? If you compare the two prints in above example, then it seems that after theutf flag is set the string is utf decoded. This results in the correct string, so it seems the original string is utf encoded (double encoded, since it already was UTF). When i select the same string manually (mysql prompt) or with PHP, then iget back the double encoded string. So it seems to me that the double encoded format is how perl stores it internally (and also in thedatabase)?But this doesnt sound right to me...this would mean that everytime a utf flagged string is used it would need to be utf decoded. That sounds not veryeffecient to me, so i doubt its done that way. But meanwhile i have no ideahow its done...and how its stored in the database. As you might have guessed i want to access the data i put in thedatabasewith PHP, but i get back double utf encoded data there. The problemcould bein alot of different places, for example my fetching in PHP, storing in perland maybe somewhere else where i have some faulty conversion. To check ifthe data in the database is correct i tried to figure out how perl works with the data. Maybe someone could put me on the right track, because this got memightyconfused ;-)To look what Perl's scalar holds, use Devel/Peek.pm. #!perl use Devel::Peek; use Encode; our $camel_utf8 = "\351\247\261\351\247\235"; print STDERR "* _utf8_on\n\n"; Encode::_utf8_on($camel_utf8); Dump($camel_utf8); print STDERR "\n"; print STDERR "* _utf8_off\n\n"; Encode::_utf8_off($camel_utf8); Dump($camel_utf8); __END__ The output is like this. The difference between _on and _off is found in FLAGS. * _utf8_on SV = PV(0x1661c60) at 0x166cccc REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8 "\x{99f1}\x{99dd}"] CUR = 6 LEN = 7 * _utf8_off SV = PV(0x1661c60) at 0x166cccc REFCNT = 1 FLAGS = (POK,pPOK) PV = 0x16db4e0 "\351\247\261\351\247\235"\0 CUR = 6 LEN = 7 SADAHIRO Tomoyuki
|
|