RE: UTF-16 -> UTF-8


 
I don't know if this will help / is related or whatever, but I did find that 
when processing UTF8 data in an Oracle database PERL *seemed* to corrupt the 
data beyond recognition : until I built it as a freestanding executable using 
the Perl Dev. Kit from Activestate - it then all worked fine.
Having already obtained a license for the PDK I thought nothing more of it, 
just made a note that it needs doing. Might a similar thing resolve your 
problem ?
By 'beyond recognition' : the script was asked to store two particular bytes 
which I expected to represent a particular glyph, but it actually stored two 
entirely different bytes which were represented by some punctuation when 
displayed in the application. I had been careful at all stages to ensure that 
the environment and the database were set to use UTF-8, and had changed nothing 
in the environment or the script to get it working - apart from building the 
executable.
Maybe it's a clue. Maybe it's a red herring. PDK's free to try for a week ...
[ you may need to 'require DBD::ODBC;' to get it to build entirely freestanding 
]
Regards,
Tim
  Rui Ribeiro <ruirib(_at_)computer(_dot_)org> wrote: Philip,

I think the problem still lies with Perl. Not with Unicode::String though. My 
guess is this:

When adding the unicode value to the Sql string in
$sql="INSERT INTO Tipo_Referencia ( Descricao ) VALUES ('$palavra_utf16');";
there is an implicit conversion from the Unicode::String object to a common 
Perl String value. The
common Perl String value doesn't "understand" Unicode, so it treats the 
multibyte char as several
single byte chars and writes them to Access that way..

I've tried another method to write to the database. But there is also an 
implicit conversion in this
instruction:

$rs->{"Descricao"} = $palavra_utf16;

$rs is the dynamic recordset to which I'll add a new record, and "Descricao" is 
the field name to
which I intended to add the Unicode value.

So I think (better to say, I guess) the problem may lie with the fact that Perl 
doesn't have native
support to Unicode in UTF-16 format (and Access doesn't have for UTF-8 !!!!). 
So using the functions
/ methods available to write to an Access database from Perl, there will always 
be a conversion to
something other than the UTF-16 recognized by Access, before the value is 
actually written.

I guess I'll have to handle my special chars outside Perl. It's less elegant, 
but probably easier to
solve.


Once again your insigths have been very instructive. Thank you so much for your 
help.
Best regards.

Rui

-----Original Message-----
From: Philip Newton [mailto:Philip(_dot_)Newton(_at_)gmx(_dot_)net]
Sent: quarta-feira, 21 de Novembro de 2001 18:29
To: Rui Ribeiro
Cc: perl-unicode(_at_)perl(_dot_)org
Subject: Re: UTF-16 -> UTF-8


On Wed, 21 Nov 2001 16:34:48 -0000, in perl.unicode you wrote:

Don't lose more time over this. It seems there is some kind of problem with
the recognition of the encoding from other Office apps.
Its rather surprising that Notepad regosnizes the characters properly and
Word and Access don't.


Would it maybe help to add a BOM (byte order mark) at the beginning of
the file?

Anyway, I suppose you can now ask more questions on a Word or Access
list; the Perl part appears to work now, as far as I can see.

Cheers,
Philip




---------------------------------
Do You Yahoo!?
Get personalised at My Yahoo!.