On Tue, 20 Nov 2001 16:49:38 +0000 (GMT), in perl.unicode you wrote:
binmode STDIN;
while(<>)
{
$u = utf16($_);
$u->byteswap2 if defined $swap; # $swap defined based on command line
options
This looks strange. The way I read the manpage, byteswap2 is meant to be
called as a function, not as a Unicode::String object method. In other
words, its first parameter is supposed to be a string, not a
Unicode::String object (which will happen if you invoke it as a method
on an object). Did you mean either
$u = utf16($_);
$u->byteswap if defined $swap;
or
$_ = byteswap2($_) if defined $swap;
$u = utf16($_);
?
print $u->utf8;
# some progress report code (one '.' every 1000 lines)
}
Having spotted the first line - could it be that I should avoid
while(<>) and use read() instead ?
That sounds good -- the U+000A (represented as '0A 00' in little-endian
order) got ripped apart by your line-oriented processing.
Actually, you can use <> as long as you change the value of $/ from its
default of "\n" to "\x0a\x00" so that it'll read the entire UTF-16
character in one go.
And your file does indeed look as if the first line was (correctly)
interpreted as UTF-16LE (probably because of the BOM "FF FE" at the
beginning), but everything afterwards as UTF-16BE (the default
endianness for Unicode::String).
So "... 00 1F 00 17 53 AC 4E 1F 00 ..." was interpreted not (as you
wanted) as "[00xx] 001F 5317 4EAC 001F" but rather as "001F 0017 53AC
4E1F [00xx]". So instead of going (Big5) "... 北 京 中 國 第 一 歷 史 檔
案 館 ... 1984 ... 微 捲 1 捲 ..." / "... Beijing Zhongguo diyi lishi
tang'an guan..." (Beijing China first historical something-or-other?),
you get mojibake or character salad, including a hyphen '-' followed by
bu 'not', a bit later one a '1', "\x7f", a '(R)' registered trademark
sign, a lowercase 'r', and so on ("厬 丟 - 下 圬 笀 ?? 毲 厔 橈 栨 餟 1
\x1f (R) ?? ?? r 挾"). So your byteswapping went wonky, presumably due
to loss of synchronisation.
So, I suggest setting $/ = "\x0a\x00" and then reading, and explicitly
byteswapping each line before converting it with utf16(). That's
assuming all your data is in little-endian UTF-16.
Cheers,
Philip