Re: UTF-16 -> UTF-8


 Philip,
Thank you - that's solved my problems.
Regards,
Tim
  Philip Newton <Philip(_dot_)Newton(_at_)gmx(_dot_)net> wrote: On Tue, 20 Nov 
2001 16:49:38 +0000 (GMT), in perl.unicode you wrote:

binmode STDIN;
while(<>)
{
$u = utf16($_);
$u->byteswap2 if defined $swap; # $swap defined based on command line options


This looks strange. The way I read the manpage, byteswap2 is meant to be
called as a function, not as a Unicode::String object method. In other
words, its first parameter is supposed to be a string, not a
Unicode::String object (which will happen if you invoke it as a method
on an object). Did you mean either

$u = utf16($_);
$u->byteswap if defined $swap;

or

$_ = byteswap2($_) if defined $swap;
$u = utf16($_);

?

print $u->utf8;
# some progress report code (one '.' every 1000 lines)
}
Having spotted the first line - could it be that I should avoid
while(<>) and use read() instead ?


That sounds good -- the U+000A (represented as '0A 00' in little-endian
order) got ripped apart by your line-oriented processing.

Actually, you can use <> as long as you change the value of $/ from its
default of "\n" to "\x0a\x00" so that it'll read the entire UTF-16
character in one go.

And your file does indeed look as if the first line was (correctly)
interpreted as UTF-16LE (probably because of the BOM "FF FE" at the
beginning), but everything afterwards as UTF-16BE (the default
endianness for Unicode::String).

So "... 00 1F 00 17 53 AC 4E 1F 00 ..." was interpreted not (as you
wanted) as "[00xx] 001F 5317 4EAC 001F" but rather as "001F 0017 53AC
4E1F [00xx]". So instead of going (Big5) "... ¥_ ¨Ê ¤¤ °ê ²Ä ¤@ ¾ú ¥v ÀÉ
®× À] ... 1984 ... ·L ±² 1 ±² ..." / "... Beijing Zhongguo diyi lishi
tang'an guan..." (Beijing China first historical something-or-other?),
you get mojibake or character salad, including a hyphen '-' followed by
bu 'not', a bit later one a '1', "\x7f", a '(R)' registered trademark
sign, a lowercase 'r', and so on ("áF ¥á - ¤U ¦d ÐB ?? Ùä ËÚ ¾ø Ñï ì] 1
\x1f (R) ?? ?? r ®µ"). So your byteswapping went wonky, presumably due
to loss of synchronisation.

So, I suggest setting $/ = "\x0a\x00" and then reading, and explicitly
byteswapping each line before converting it with utf16(). That's
assuming all your data is in little-endian UTF-16.

Cheers,
Philip


---------------------------------
Do You Yahoo!?
Get personalised at My Yahoo!.

<Prev in Thread]	Current Thread	[Next in Thread>
UTF-16 -> UTF-8, Tim Scott Re: UTF-16 -> UTF-8, Philip Newton RE: UTF-16 -> UTF-8, Rui Ribeiro RE: UTF-16 -> UTF-8, Tim Scott Re: UTF-16 -> UTF-8, Philip Newton Re: UTF-16 -> UTF-8, Tim Scott Re: UTF-16 -> UTF-8, Philip Newton Re: UTF-16 -> UTF-8, Tim Scott <= Re: UTF-16 -> UTF-8, Martin Duerst Re: UTF-16 -> UTF-8, Tim Scott RE: UTF-16 -> UTF-8, Edward Cherlin Re: UTF-16 -> UTF-8, Philip Newton Re: UTF-16 -> UTF-8, Tim Scott Re: UTF-16 -> UTF-8, Gisle Aas RE: UTF-16 -> UTF-8, Rui Ribeiro Re: UTF-16 -> UTF-8, Philip Newton RE: UTF-16 -> UTF-8, Rui Ribeiro Re: UTF-16 -> UTF-8, Philip Newton