perl-unicode

UTF-16BE -> UTF-8 encoding() error

2007-11-29 02:27:04
Hi...I found this DL via the perldoc.perl.org/perluniintro page...if I'm
violating protocol for writing directly, please pardon.

I have 2 data files I want to compare...one is in UTF-16BE (Windows
"Unicode" format) and one is in UTF-8 format.

I wrote 3 perl programs: 
*)1 to normalize data in the UTF-16BE source and write to a UTF-8
formatted output file
*)1 to normalize data in the UTF-8 source and write to a UTF-8 output
file
*)1 to do a string comparison of the 2 output files and output 3 files:
"common items from both files", "items unique to UTF-16BE source", and
"items unique to UTF-8 source".

I noticed that the UTF16BE->UTF-8 conversion works fine, except for a
very few characters.  Specifically, the Right-Quote:
http://www.fileformat.info/info/unicode/char/2019/index.htm

It is appearing in the source UTF-16BE file in a character stream such
as "...owe's...", where the ' is the character above, not the apostrophe
I have used to represent it.

The problem seems to me is that when the decode function sees it, it is
merging the "'s" into some other bizarre characters, and I have to do
this replacement BEFORE decode() to avoid the problem:
 $char_inline =~ s/\x19\xE2\x81\xB3/\xE2\x80\x99\x73/;

I've tried using the Unicode::Normalization routines, sometimes  before
and sometimes after decode() to test all possible states that might
yield the right result to no avail.

While this fails via decode(), if I use "iconv -f UTF-16 -t UTF-8" on
Solaris 9, the resultant output file is in UTF-8 format, and has the
correct Right-Quote character.

This makes me think that the decode function, or Perl-internal code page
conversion function is incomplete/in error for at least a portion of the
available code pages between various Unicode code-pages.  Since it would
appear that the Normalization routines only really have value **after**
the decode() conversion from some-random-code-page -> UTF-8, it would be
great if there were a way to ensure that the initial conversion was
always correct and complete.

With the exception of iconv, I ran perl on Windows, so, perhaps there is
a problem only with the Windows port?  Otherwise:
1) Please be aware of this error
2) Any suggestions (other than pre-translating via "iconv" ;-)

Thanks!
-NICK

<Prev in Thread] Current Thread [Next in Thread>