perl-i18n

Problems with Perl Asian encodings?

2007-05-10 12:31:38
All -

I'm having a problem that perhaps someone here can cast some light on. For a very long time, a project I work on has been using GNU recode 3.6 to transcode a wide range of encodings into UTF-8, including some of the more common Korean, Japanese and Chinese encodings (e.g., SJIS, gb2312, EUC-KR). For efficiency reasons, we've been looking at moving to the Perl Encode module, which we already use for windows-1252 because of corruption issues with Arabic with GNU recode.

I compared a batch of about 120,000 documents for which I had both the original and the output of GNU recode, and discovered a (relatively small) number of differences, say about 4K documents. In almost all of those cases, Perl recode appears to be inferior. The vast majority of the differing documents are gb2132; however, many of the other Asian encodings have sporadic problems. When I examine the documents for differences, I typically find that Perl recode has introduced some stray "unknown" characters at various points in the document, while the GNU recode version is clean.

Has anyone else done such a comparison of GNU recode and Perl Encode? I'd very much prefer to move the Perl, not simply for efficiency but because, unlike GNU recode, it appears to be actively maintained; however, the error rate is just too high, especially considering that the GNU recode output looks clean, and our users have not complained about it.

Any comments or advice would be welcome. I'm using Perl 5.8.7 (I known, it's not the latest version, but it's part of a very stable configuration that the project doesn't want to vary).

Thanks in advance -
Sam Bayer
The MITRE Corporation
sam(_at_)mitre(_dot_)org

P.S. My familiarity with encoding issues is not extensive, and one thing that occurred to me was that there may be an encoding name conflict between GNU recode and Perl recode which was leading to the differences I was seeing. However, in the first two cases I examined, no encoding known to Perl Encode for the given languages (Chinese and Japanese) yielded the same (clean) output as GNU recode.

<Prev in Thread] Current Thread [Next in Thread>