Re: UTF-16BE -> UTF-8 encoding() error

On 2007-11-29 01:04, Jenkins, Nicholas S (GE Money) wrote:


I have 2 data files I want to compare...one is in UTF-16BE (Windows
"Unicode" format) and one is in UTF-8 format.

I wrote 3 perl programs:*)1 to normalize data in the UTF-16BE source and write to a UTF-8

formatted output file
*)1 to normalize data in the UTF-8 source and write to a UTF-8 output
file
*)1 to do a string comparison of the 2 output files and output 3 files:
"common items from both files", "items unique to UTF-16BE source", and
"items unique to UTF-8 source".

I noticed that the UTF16BE->UTF-8 conversion works fine, except for a
very few characters.  Specifically, the Right-Quote:
http://www.fileformat.info/info/unicode/char/2019/index.htm

It is appearing in the source UTF-16BE file in a character stream such
as "...owe's...", where the ' is the character above, not the apostrophe
I have used to represent it.

The problem seems to me is that when the decode function sees it, it is
merging the "'s" into some other bizarre characters, and I have to do
this replacement BEFORE decode() to avoid the problem:
 $char_inline =~ s/\x19\xE2\x81\xB3/\xE2\x80\x99\x73/;

You explain it as if the string "\x19\xE2\x81\xB3" is found in theTF16BE file. Apparently, this is not the case, because you say to

replace that string with some UTF8 string before decode. (e28099 = utf8
 sequence for U2019, 73 = "s").  After the replacement the decode()
succeeds, you say.

So how did you get that strange string into the program at that point?
What manipulations did you do before?  Because you seem to handle it

has UTF-8 encoded already at that point. So your program alreadyconverted the UTF16BE to something else already.


So explain how you got that string first.


I've tried using the Unicode::Normalization routines, sometimes  before
and sometimes after decode() to test all possible states that might
yield the right result to no avail.

While this fails via decode(), if I use "iconv -f UTF-16 -t UTF-8" on
Solaris 9, the resultant output file is in UTF-8 format, and has the
correct Right-Quote character.

This makes me think that the decode function, or Perl-internal code page
conversion function is incomplete/in error for at least a portion of the
available code pages between various Unicode code-pages.  Since it would
appear that the Normalization routines only really have value **after**
the decode() conversion from some-random-code-page -> UTF-8, it would be
great if there were a way to ensure that the initial conversion was
always correct and complete.

With the exception of iconv, I ran perl on Windows, so, perhaps there is
a problem only with the Windows port?  Otherwise:
1) Please be aware of this error
2) Any suggestions (other than pre-translating via "iconv" ;-)




--
Paul Bijnens, xplanation Technology Services        Tel  +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM    Fax  +32 16 397.512
http://www.xplanation.com/          email:  
Paul(_dot_)Bijnens(_at_)xplanation(_dot_)com
***********************************************************************
* I think I've got the hang of it now:  exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt,  abort,  hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e,  kill -1 $$,  shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ...  "Are you sure?"  ...   YES   ...   Phew ...   I'm out          *
***********************************************************************