perl-unicode

Re: UTF-16 -> UTF-8

2001-11-20 13:19:46
On Tue, 20 Nov 2001 16:49:38 +0000 (GMT), in perl.unicode you wrote:

binmode STDIN;
while(<>)
{
  $u = utf16($_);
  $u->byteswap2 if defined $swap; # $swap defined based on command line 
options

This looks strange. The way I read the manpage, byteswap2 is meant to be
called as a function, not as a Unicode::String object method. In other
words, its first parameter is supposed to be a string, not a
Unicode::String object (which will happen if you invoke it as a method
on an object). Did you mean either

    $u = utf16($_);
    $u->byteswap if defined $swap;

or

    $_ = byteswap2($_) if defined $swap;
    $u = utf16($_);

?

  print $u->utf8;
# some progress report code (one '.' every 1000 lines)
}
Having spotted the first line - could it be that I should avoid
while(<>) and use read() instead ?

That sounds good -- the U+000A (represented as '0A 00' in little-endian
order) got ripped apart by your line-oriented processing.

Actually, you can use <> as long as you change the value of $/ from its
default of "\n" to "\x0a\x00" so that it'll read the entire UTF-16
character in one go.

And your file does indeed look as if the first line was (correctly)
interpreted as UTF-16LE (probably because of the BOM "FF FE" at the
beginning), but everything afterwards as UTF-16BE (the default
endianness for Unicode::String).

So "... 00 1F 00 17 53 AC 4E 1F 00 ..." was interpreted not (as you
wanted) as "[00xx] 001F 5317 4EAC 001F" but rather as "001F 0017 53AC
4E1F [00xx]". So instead of going (Big5) "... 北 京 中 國 第 一 歷 史 檔
案 館 ... 1984 ... 微 捲 1 捲 ..." / "... Beijing Zhongguo diyi lishi
tang'an guan..." (Beijing China first historical something-or-other?),
you get mojibake or character salad, including a hyphen '-' followed by
bu 'not', a bit later one a '1', "\x7f", a '(R)' registered trademark
sign, a lowercase 'r', and so on ("厬 丟 - 下 圬 笀 ?? 毲 厔 橈 栨 餟 1
\x1f (R) ?? ?? r 挾"). So your byteswapping went wonky, presumably due
to loss of synchronisation.

So, I suggest setting $/ = "\x0a\x00" and then reading, and explicitly
byteswapping each line before converting it with utf16(). That's
assuming all your data is in little-endian UTF-16.

Cheers,
Philip

<Prev in Thread] Current Thread [Next in Thread>