Re: Problems with Perl Asian encodings?

Samuel L. Bayer wrote:

Has anyone else done such a comparison of GNU recode and Perl Encode?I'd very much prefer to move the Perl, not simply for efficiency butbecause, unlike GNU recode, it appears to be actively maintained;however, the error rate is just too high, especially considering thatthe GNU recode output looks clean, and our users have not complainedabout it.


Hi again all -

Last week, I sent out a query about Asian encodings and Perl Encode vs.GNU recode. Martin Thurn graciously helped me debug this problem, and Ican now summarize as follows, quoting Martin:


"  In the sample data you sent, in the original GB2312, right after the
word "diode", there is an octal \244 and octal \112.  Octal \244 =
decimal 164 which is not a legal first-byte in GB2312.
  Recode apparently dropped the \244 and left the \112 as-is, a capital
J.
  Encode apparently converted the \244 to a default UTF-8 "unknown
character" and left the \112 as-is, a capital J."

So the outcome was that there's a mode in GNU recode which will dropthese illegal first bytes. So the question is: is the same thingpossible in Perl Encode? The documentation for some of the FB_ variablesis tempting, but pretty opaque.

Again, I'm using Perl 5.8.7, with the versions of Encode that come withthat distribution.


Thanks so much in advance -

Sam Bayer
The MITRE Corporation
sam(_at_)mitre(_dot_)org