perl-i18n

Re: Problems with Perl Asian encodings?

2007-05-14 09:16:14
Samuel L. Bayer wrote:

Has anyone else done such a comparison of GNU recode and Perl Encode? I'd very much prefer to move the Perl, not simply for efficiency but because, unlike GNU recode, it appears to be actively maintained; however, the error rate is just too high, especially considering that the GNU recode output looks clean, and our users have not complained about it.

Hi again all -

Last week, I sent out a query about Asian encodings and Perl Encode vs. GNU recode. Martin Thurn graciously helped me debug this problem, and I can now summarize as follows, quoting Martin:

"  In the sample data you sent, in the original GB2312, right after the
word "diode", there is an octal \244 and octal \112.  Octal \244 =
decimal 164 which is not a legal first-byte in GB2312.
  Recode apparently dropped the \244 and left the \112 as-is, a capital
J.
  Encode apparently converted the \244 to a default UTF-8 "unknown
character" and left the \112 as-is, a capital J."

So the outcome was that there's a mode in GNU recode which will drop these illegal first bytes. So the question is: is the same thing possible in Perl Encode? The documentation for some of the FB_ variables is tempting, but pretty opaque.

Again, I'm using Perl 5.8.7, with the versions of Encode that come with that distribution.

Thanks so much in advance -

Sam Bayer
The MITRE Corporation
sam(_at_)mitre(_dot_)org

<Prev in Thread] Current Thread [Next in Thread>