Re: UTF-16 -> UTF-8


 Philip,
Here's the first 256 bytes of each file for which the conversion produced 
unexpected results.
FF FE 03 00 01 00 0A 
00 18 00 6A 00 5A 00 01 00 00 00 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 4B 00 5A 00 48 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 43 00 01 00 00 00 00 00 00 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 4D 00 53 00 58 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 43 00 08 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 0D 00 20 00 39 00 31 00 30 00 30 00 30 00 30 00 31 00 1F 00 17 53 AC 4E 1F 
00 2D 4E 0B 57 2C 7B 00 4E 77 6B F2 53 94 6A 48 68 28 99 1F 00 31 00 39 00 38 
00 34 00 1F 00 AE 5F 72 63 31 00 72 63 1F 00 
^^^^^ before / after vvvvv
EF BB BF 03 01 0D 0A 
18 6A 5A 01 00 61 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 4B 5A 48 00 00 00 00 00 00 00 00 43 01 00 00 00 07 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 4D 53 58 00 00 00 00 00 00 00 00 43 08 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 0D 20 39 31 30 30 30 30 31 1F 17 E5 8E 
AC E4 B8 9F 2D E4 B8 8B E5 9C AC E7 AC 80 E4 B9 B7 E6 AF B2 E5 8E 94 E6 A9 88 
E6 A0 A8 E9 A4 9F 31 39 38 34 1F C2 AE E5 BD B2 E6 8C B1 72 E6 8C 9F 1F 31 36 
C3 90 E9 85 B3 E7 B0 9F 2E E7 BA AE E5 BC B6 E5 8B 81 E5 90 9F 43 48 49 1F 1F 
1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 1F 40 0D E4 BA 97 E5 BC 87 E8 A7 BD E8 A0 AE 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 
and salient parts of the code I used :-
binmode STDIN;
while(<>)
{
  $u = utf16($_);
  $u->byteswap2 if defined $swap; # $swap defined based on command line options
  print $u->utf8;
# some progress report code (one '.' every 1000 lines)
}
Having spotted the first line - could it be that I should avoid while(<>) and 
use read() instead ?
Thanks,
Tim
  Philip Newton <Philip(_dot_)Newton(_at_)gmx(_dot_)net> wrote: On Tue, 20 Nov 
2001 15:59:07 +0000 (GMT), in perl.unicode you wrote:

b. One file worked fine, but for another it converted the Chinese
data to different Chinese data.


Did you see any correlation between the code points? Like, say, turning
4567 into 6745?

Can you give an example of "before" and "after" data?

PS: Does anyone know of - even an odd looking - Fixed pitch Unicode font
including Western European, CJK, Cyrillic and Greek glyphs (ie: most Left
to Right data) ? It's not for an end-user, it's for techies like myself,
so it doesn't need to be brilliant, just more distinctive than a set of
squares or blocks !


I think MS Mincho (that came with Japanese language pack for MSIE 3.0, I
think) is fixed-width and has Western, Cyrillic, and Greek glyphs --
and, of course, a large assortment of CJK. But I've only used it for CJK
so I can't say for sure.

Cheers,
Philip


---------------------------------
Do You Yahoo!?
Get personalised at My Yahoo!.