Problems with Perl Asian encodings?

All -

I'm having a problem that perhaps someone here can cast some light on.For a very long time, a project I work on has been using GNU recode 3.6to transcode a wide range of encodings into UTF-8, including some of themore common Korean, Japanese and Chinese encodings (e.g., SJIS, gb2312,EUC-KR). For efficiency reasons, we've been looking at moving to thePerl Encode module, which we already use for windows-1252 because ofcorruption issues with Arabic with GNU recode.

I compared a batch of about 120,000 documents for which I had both theoriginal and the output of GNU recode, and discovered a (relativelysmall) number of differences, say about 4K documents. In almost all ofthose cases, Perl recode appears to be inferior. The vast majority ofthe differing documents are gb2132; however, many of the other Asianencodings have sporadic problems. When I examine the documents fordifferences, I typically find that Perl recode has introduced some stray"unknown" characters at various points in the document, while the GNUrecode version is clean.

Has anyone else done such a comparison of GNU recode and Perl Encode?I'd very much prefer to move the Perl, not simply for efficiency butbecause, unlike GNU recode, it appears to be actively maintained;however, the error rate is just too high, especially considering thatthe GNU recode output looks clean, and our users have not complainedabout it.

Any comments or advice would be welcome. I'm using Perl 5.8.7 (I known,it's not the latest version, but it's part of a very stableconfiguration that the project doesn't want to vary).


Thanks in advance -
Sam Bayer
The MITRE Corporation
sam(_at_)mitre(_dot_)org

P.S. My familiarity with encoding issues is not extensive, and one thingthat occurred to me was that there may be an encoding name conflictbetween GNU recode and Perl recode which was leading to the differencesI was seeing. However, in the first two cases I examined, no encodingknown to Perl Encode for the given languages (Chinese and Japanese)yielded the same (clean) output as GNU recode.