Samuel L. Bayer wrote:
Has anyone else done such a comparison of GNU recode and Perl Encode?
I'd very much prefer to move the Perl, not simply for efficiency but
because, unlike GNU recode, it appears to be actively maintained;
however, the error rate is just too high, especially considering that
the GNU recode output looks clean, and our users have not complained
about it.
Hi again all -
Last week, I sent out a query about Asian encodings and Perl Encode vs.
GNU recode. Martin Thurn graciously helped me debug this problem, and I
can now summarize as follows, quoting Martin:
" In the sample data you sent, in the original GB2312, right after the
word "diode", there is an octal \244 and octal \112. Octal \244 =
decimal 164 which is not a legal first-byte in GB2312.
Recode apparently dropped the \244 and left the \112 as-is, a capital
J.
Encode apparently converted the \244 to a default UTF-8 "unknown
character" and left the \112 as-is, a capital J."
So the outcome was that there's a mode in GNU recode which will drop
these illegal first bytes. So the question is: is the same thing
possible in Perl Encode? The documentation for some of the FB_ variables
is tempting, but pretty opaque.
Again, I'm using Perl 5.8.7, with the versions of Encode that come with
that distribution.
Thanks so much in advance -
Sam Bayer
The MITRE Corporation
sam(_at_)mitre(_dot_)org