Problems with Perl Asian encodings?
I'm having a problem that perhaps someone here can cast some light on.
For a very long time, a project I work on has been using GNU recode 3.6
to transcode a wide range of encodings into UTF-8, including some of the
more common Korean, Japanese and Chinese encodings (e.g., SJIS, gb2312,
EUC-KR). For efficiency reasons, we've been looking at moving to the
Perl Encode module, which we already use for windows-1252 because of
corruption issues with Arabic with GNU recode.
I compared a batch of about 120,000 documents for which I had both the
original and the output of GNU recode, and discovered a (relatively
small) number of differences, say about 4K documents. In almost all of
those cases, Perl recode appears to be inferior. The vast majority of
the differing documents are gb2132; however, many of the other Asian
encodings have sporadic problems. When I examine the documents for
differences, I typically find that Perl recode has introduced some stray
"unknown" characters at various points in the document, while the GNU
recode version is clean.
Has anyone else done such a comparison of GNU recode and Perl Encode?
I'd very much prefer to move the Perl, not simply for efficiency but
because, unlike GNU recode, it appears to be actively maintained;
however, the error rate is just too high, especially considering that
the GNU recode output looks clean, and our users have not complained
Any comments or advice would be welcome. I'm using Perl 5.8.7 (I known,
it's not the latest version, but it's part of a very stable
configuration that the project doesn't want to vary).
Thanks in advance -
The MITRE Corporation
P.S. My familiarity with encoding issues is not extensive, and one thing
that occurred to me was that there may be an encoding name conflict
between GNU recode and Perl recode which was leading to the differences
I was seeing. However, in the first two cases I examined, no encoding
known to Perl Encode for the given languages (Chinese and Japanese)
yielded the same (clean) output as GNU recode.