perl-unicode

RE: UTF-16 -> UTF-8

2001-11-21 16:13:44
Yes Tim I see your point. There is probably a relation with my problem. But
is seems a bit strange that that happened using UTF-8, because Perl 5.6
seems to treat UTF-8 chars properly.
My big problem is that I simply can't use UTF-8 because MS databases only
recognize UTF-16 or UCS2 (I think they're the same right?). They can't
handle UTF-8 at all. If they could I think there wouldn't be a problem.

Its just a pity that in Perl UTF-8 is the "native" format for Unicode
support. At least for the Windows environment, for which UTF-16 is the
native format. Then you have this type of problems ocurring.

But there are other inconsistencies even in the Windows "universe". If you
want to ouput database Unicode content  to a browser you need to use UTF-8,
when using IIS. Fortunately, the conversion from UTF-16 to UTF-8 can be done
automatically using several methods from ASP objects (I know that because my
database content in Unicode will have to be viewed in a browser and I had to
test it). So it is a rather messy situation.

Thanks for your tip.

Regards,

Rui
  -----Original Message-----
  From: Tim Scott [mailto:gulbrain(_at_)yahoo(_dot_)co(_dot_)uk]
  Sent: quarta-feira, 21 de Novembro de 2001 22:37
  To: Rui Ribeiro; Philip Newton
  Cc: perl-unicode(_at_)perl(_dot_)org
  Subject: RE: UTF-16 -> UTF-8



  I don't know if this will help / is related or whatever, but I did find
that when processing UTF8 data in an Oracle database PERL *seemed* to
corrupt the data beyond recognition : until I built it as a freestanding
executable using the Perl Dev. Kit from Activestate - it then all worked
fine.

  Having already obtained a license for the PDK I thought nothing more of
it, just made a note that it needs doing. Might a similar thing resolve your
problem ?

  By 'beyond recognition' : the script was asked to store two particular
bytes which I expected to represent a particular glyph, but it actually
stored two entirely different bytes which were represented by some
punctuation when displayed in the application. I had been careful at all
stages to ensure that the environment and the database were set to use
UTF-8, and had changed nothing in the environment or the script to get it
working - apart from building the executable.

  Maybe it's a clue. Maybe it's a red herring. PDK's free to try for a week
....
  [ you may need to 'require DBD::ODBC;' to get it to build entirely
freestanding ]

  Regards,
  Tim

    Rui Ribeiro <ruirib(_at_)computer(_dot_)org> wrote:

    Philip,

    I think the problem still lies with Perl. Not with Unicode::String
though. My guess is this:

    When adding the unicode value to the Sql string in
    $sql="INSERT INTO Tipo_Referencia ( Descricao ) VALUES
('$palavra_utf16');";
    there is an implicit conversion from the Unicode::String object to a
common Perl String value. The
    common Perl String value doesn't "understand" Unicode, so it treats the
multibyte char as several
    single byte chars and writes them to Access that way..

    I've tried another method to write to the database. But there is also an
implicit conversion in this
    instruction:

    $rs->{"Descricao"} = $palavra_utf16;

    $rs is the dynamic recordset to which I'll add a new record, and
"Descricao" is the field name to
    which I intended to add the Unicode value.

    So I think (better to say, I guess) the problem may lie with the fa! ct
that Perl doesn't have native
    support to Unicode in UTF-16 format (and Access doesn't have for UTF-8
!!!!). So using the functions
    / methods available to write to an Access database from Perl, there will
always be a conversion to
    something other than the UTF-16 recognized by Access, before the value
is actually written.

    I guess I'll have to handle my special chars outside Perl. It's less
elegant, but probably easier to
    solve.


    Once again your insigths have been very instructive. Thank you so much
for your help.
    Best regards.

    Rui

    > -----Original Message-----
    > From: Philip Newton [mailto:Philip(_dot_)Newton(_at_)gmx(_dot_)net]
    > Sent: quarta-feira, 21 de Novembro de 2001 18:29
    > To: Rui Ribeiro
    > Cc: perl-unicode(_at_)perl(_dot_)org
    > Subject: Re: UTF-16 -> UTF-8
    >
    >
    > On Wed, 21 Nov 2001 16:34:48 -0000, in perl.unicode you wrote:
    >
    > > Don't lose more time over this. It seems there is som! e kind of
problem with
    > > the recognition of the encoding from other Office apps.
    > > Its rather surprising that Notepad regosnizes the characters
properly and
    > > Word and Access don't.
    >
    > Would it maybe help to add a BOM (byte order mark) at the beginning of
    > the file?
    >
    > Anyway, I suppose you can now ask more questions on a Word or Access
    > list; the Perl part appears to work now, as far as I can see.
    >
    > Cheers,
    > Philip
    >





----------------------------------------------------------------------------
--
  Do You Yahoo!?
  Get personalised at My Yahoo!.
<Prev in Thread] Current Thread [Next in Thread>