Hi...I found this DL via the perldoc.perl.org/perluniintro page...if I'm
violating protocol for writing directly, please pardon.
I have 2 data files I want to compare...one is in UTF-16BE (Windows
"Unicode" format) and one is in UTF-8 format.
I wrote 3 perl programs:
*)1 to normalize data in the UTF-16BE source and write to a UTF-8
formatted output file
*)1 to normalize data in the UTF-8 source and write to a UTF-8 output
file
*)1 to do a string comparison of the 2 output files and output 3 files:
"common items from both files", "items unique to UTF-16BE source", and
"items unique to UTF-8 source".
I noticed that the UTF16BE->UTF-8 conversion works fine, except for a
very few characters. Specifically, the Right-Quote:
http://www.fileformat.info/info/unicode/char/2019/index.htm
It is appearing in the source UTF-16BE file in a character stream such
as "...owe's...", where the ' is the character above, not the apostrophe
I have used to represent it.
The problem seems to me is that when the decode function sees it, it is
merging the "'s" into some other bizarre characters, and I have to do
this replacement BEFORE decode() to avoid the problem:
$char_inline =~ s/\x19\xE2\x81\xB3/\xE2\x80\x99\x73/;
I've tried using the Unicode::Normalization routines, sometimes before
and sometimes after decode() to test all possible states that might
yield the right result to no avail.
While this fails via decode(), if I use "iconv -f UTF-16 -t UTF-8" on
Solaris 9, the resultant output file is in UTF-8 format, and has the
correct Right-Quote character.
This makes me think that the decode function, or Perl-internal code page
conversion function is incomplete/in error for at least a portion of the
available code pages between various Unicode code-pages. Since it would
appear that the Normalization routines only really have value **after**
the decode() conversion from some-random-code-page -> UTF-8, it would be
great if there were a way to ensure that the initial conversion was
always correct and complete.
With the exception of iconv, I ran perl on Windows, so, perhaps there is
a problem only with the Windows port? Otherwise:
1) Please be aware of this error
2) Any suggestions (other than pre-translating via "iconv" ;-)
Thanks!
-NICK