perl-unicode

Re: Fallback problems with Encode

2002-12-23 16:30:03
Earl Hood <earl(_at_)earlhood(_dot_)com> writes:
Take the following code snippet:

   use Encode q(:all);
   print $Encode::VERSION, "\n";

   my $org = '';
   for my $i (0x20..0xFF){
      $org .= chr($i);
   }
   my $src = $org;
   print "\nASCII -> UTF8\n";
   from_to($src, 'ascii', 'utf8', FB_XMLCREF);
   print $src, "\n";

Prints out the following:

   1.83

   ASCII -> UTF8
    !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
   abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
   \x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97

After some further hacking, I notices that the success of the
FB_XMLCREF constant is not consistent.  I add the following to the
script above:

   my $src = $org;
   print "\nISO-8859-3 -> ISO-8859-8\n";
   from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
   print $src, "\n";


   ISO-8859-3 -> ISO-8859-8
    !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
    abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ
    ‘’“”•–—˜™š›œžŸ &#x126;&#x2d8;£¤\xA5&#x124;§¨

Any insights to this behavior will be appreciated.

from_to is implemented by translating 'from' source to Unicode,
and 'to' destination.  

The FB_XMLCREF happens on the 'to' side. Your original code suffers
from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.

So when you use an 8-bit encoding like iso8859-3 you don't see the problem.

The behaviour is (almost) by design - i.e. it happened that way and 
I decided it made a kind of sense. Using ASCII is considered as 
asking for 7-bit ness. If you want one of 8-bit super-sets use the 
one you want (iso8859-1 aka latin1 most likely, but perhaps one
of the windows ones with smart quotes, m-dash etc.)

There is a good case for a "latin-guess" or latin-superset or ... 
which trys to do the right thing.
 
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

<Prev in Thread] Current Thread [Next in Thread>