RE: UTF-16 -> UTF-8

Philip,

Thank you for your help.
This work is being done by a couple of students of mine, so I just sent you one 
of the results of
the experiments. But they have tried other things. So I'll make some localized 
comments bellow.

On Tue, 20 Nov 2001 16:35:25 -0000, in perl.unicode you wrote:

open(FICH1,"fich1.txt")||die"Nao foi possivel abrir o ficheiro fich1.txt";
open(FICH3,">fich3.txt")||die"Nao foi possivel abrir o ficheiro fich3.txt";


Good that you check for success, but you should also include the reason
-- it's in $!. For example:

    open(FICH1, "fich1.txt") || die "Nao foi possivel abrir " .
                                    "o ficheiro fich1.txt: $!";

use utf8;


Yes, there is no need for it;


You shouldn't need that. Unicode::String will do all the Unicodery for
you; your program only needs to handly 'plain' bytes.

while (<FICH1>) {
    chomp($_);
    $palavra1=$_;
    @array=split(/ /,$palavra1);


What do you use $palavra1 and @array for? (And @array is usually a bad
variable name.)


Yes, quite true. I guess they left it from some experiment and I overlooked it.

    $palavra2=utf16($_);


Here is a mistake. If you call utf16($_), it means "$_ is a string
encoded in UTF-16. Take it and convert it into a Unicode::String
object."


We've tried with utf8. It does read well and it writes well as long as you 
write it in utf8.


But you said you wanted to convert from UTF-8 to UTF-16. So you probably
want something like

    $palavra_objeito = utf8($_);
    $palavra_em_utf16 = $palavra_objeito->utf16;


We've tried just that and the result wasn't what we expected...


Note that ->utf16 will return UTF-16BE, as I understand it, since
"Internally a Unicode::String object is a string of 2 byte values in
network byte order (big-endian)" (quote from the docs). So if your
database and/or file wants UTF-16LE (which is more natural for Intel
chips), then you need to do something such as

    $palavra_objeito->byteswap;


Now there's something we didn't try.


first (after you assign to $palavra_objeito and before you call ->utf16)
to convert from big-endian to little-endian.

    $sql =  "INSERT INTO Tipo_Referencia ( Descricao ) SELECT '$palavra2' 
AS Expr1;";


Is there a reason why you don't write this as

    $sql = "INSERT INTO Tipo_Referencia ( Descricao ) " .
           "VALUES ('$palavra_em_utf16')"


Not really, but the previous sintax has worked many times.


? The "INSERT INTO table (columns) VALUES (literals)" is, for me, the
usual syntax, and "INSERT INTO table (columns) SELECT literals AS dummy"
looks strange to me.


Maybe, I just copied the sintax from an Access Query. It worked in many 
occasions. Even writing an
UTF-8 value worked with that sintax. Obviously Access didn't make much sense of 
it as UTF-8 isn't
really something it "understands".

But your sintax is the most correct one (and the one respecting SQL standard).

    print FICH3 $palavra2,"\n";
    $conn->execute($sql,,,adExecuteNoRecords);


This is the same as

    $conn->execute($sql,adExecuteNoRecords);

.. If the constant adExecuteNoRecords has to be the fourth parameter to
->execute, then say so:

    $conn->execute($sql, undef, undef, adExecuteNoRecords);

.. Perl isn't Visual Basic :)


There, you caught me. I'm much more fluent in VB than in Perl, and I was the 
one that gave my
students the ADO code...

To summarise, I think you have misunderstood how Unicode::String works.
utf16() (called as a function, not a method) doesn't convert a strong
*to* UTF-16, it expects a string in UTF-16 and converts *from* that
encoding into the internal format used by Unicode::String and returns an
object. Then you can call methods on that object to produce another
encoding such as UTF-8 or Latin-1 or whatever. So conversions involving
Unicode::String generally involve at least two calls.


Not quite, but it is clear that it was a bad example and your conclusions are, 
therefore justified.
I'll try your suggestions and let you know about the result.

Thank you for your time and your help.

Regards.

Rui