perl-unicode

Re: Fallback problems with Encode

2002-12-23 17:30:05
On December 23, 2002 at 22:41, Nick Ing-Simmons wrote:

Prints out the following:

   1.83

   ASCII -> UTF8
    !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
   abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
   \x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97

After some further hacking, I notices that the success of the
FB_XMLCREF constant is not consistent.  I add the following to the
script above:

   my $src = $org;
   print "\nISO-8859-3 -> ISO-8859-8\n";
   from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
   print $src, "\n";


   ISO-8859-3 -> ISO-8859-8
    !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
    
abcdefghijklmnopqrstuvwxyz{|}~\xC2\x80\xC2\x81\xC2\x82\xC2\x83\xC2\x84\xC2\x85\xC2\x86\xC2\x87\xC2\x88\xC2\x89\xC2\x8A\xC2\x8B\xC2\x8C\xC2\x8D\xC2ツ\x8F
    
\xC2\x90\xC2\x91\xC2\x92\xC2\x93\xC2\x94\xC2\x95\xC2\x96\xC2\x97\xC2\x98\xC2\x99\xC2\x9A\xC2\x9B\xC2\x9C\xC2\x9D\xC2\x9E\xC2\x9F\xC2\xA0&#x126;&#x2d8;贈造\xA5&#x124;則即

Look more closely,  I will include some of the output again:

  &#x126;&#x2d8;\x{25c0}\xA5&#x124;Ж
  &#x130;&#x15e;&#x11e;&#x134;\xAD\xAE&#x17b;\xB0&#x127;桶患&#x125;係
  &#x131;&#x15f;&#x11f;&#x135;\xBD\xBE&#x17c;&#xc0;&#xc1;&#xc2;\xC3
  &#xc4;&#x10a;&#x108;&#xc7;&#xc8;&#xc9;&#xca;&#xcb;&#xcc;&#xcd;
  &#xce;&#xcf;\xD0&#xd1;&#xd2;&#xd3;&#xd4;&#x120;&#xd6;\xAA&#x11c;
  &#xd9;&#xda;&#xdb;&#xdc;&#x16c;&#x15c;&#xdf;&#xe0;&#xe1;&#xe2;
  \xE3&#xe4;&#x10b;&#x109;&#xe7;&#xe8;&#xe9;&#xea;&#xeb;&#xec;
  &#xed;&#xee;&#xef;\xF0&#xf1;&#xf2;&#xf3;&#xf4;&#x121;&#xf6;\xBA
  &#x11d;&#xf9;&#xfa;&#xfb;&#xfc;&#x16d;&#x15d;&#x2d9;

Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0

from_to is implemented by translating 'from' source to Unicode,
and 'to' destination.  

This is what I figured since it is impractical to maintain unique
conversion tables between all types of character encodings.

The FB_XMLCREF happens on the 'to' side. Your original code suffers
from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.

So when you use an 8-bit encoding like iso8859-3 you don't see the problem.

See above where I highlight the problem characters.  Also, with
the iso-2022-jp examples provided in my original post, illustrated
the problem.

BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
used for the ascii test, and entity references are generated for the
8-bit characters.

As I stated in my original post, the problem is that t/fallbacks.t
tests an undocumented (or poorly documented) Encode interface, and
it does not test the well-documented interface.

For example, extending from my code sample in the original post,
if you add the following:

  my $meth = find_encoding('ascii');
  my $src  = $org;
  my $dst  = $meth->encode($src, FB_XMLCREF);
  print $dst, "\n";

The following is generated:

   !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
  abcdefghijklmnopqrstuvwxyz{|}~&#x80;&#x81;&#x82;&#x83;&#x84;&#x85;&#x86;
  &#x87;&#x88;&#x89;&#x8a;&#x8b;&#x8c;&#x8d;&#x8e;&#x8f;&#x90;&#x91;&#x92;
  &#x93;&#x94;&#x95;&#x96;&#x97;&#x98;&#x99;&#x9a;&#x9b;&#x9c;&#x9d;&#x9e;
  &#x9f;&#xa0;&#xa1;&#xa2;&#xa3;&#xa4;&#xa5;&#xa6;&#xa7;&#xa8;&#xa9;&#xaa;
  &#xab;&#xac;&#xad;&#xae;&#xaf;&#xb0;&#xb1;&#xb2;&#xb3;&#xb4;&#xb5;&#xb6;
  &#xb7;&#xb8;&#xb9;&#xba;&#xbb;&#xbc;&#xbd;&#xbe;&#xbf;&#xc0;&#xc1;&#xc2;
  &#xc3;&#xc4;&#xc5;&#xc6;&#xc7;&#xc8;&#xc9;&#xca;&#xcb;&#xcc;&#xcd;&#xce;
  &#xcf;&#xd0;&#xd1;&#xd2;&#xd3;&#xd4;&#xd5;&#xd6;&#xd7;&#xd8;&#xd9;&#xda;
  &#xdb;&#xdc;&#xdd;&#xde;&#xdf;&#xe0;&#xe1;&#xe2;&#xe3;&#xe4;&#xe5;&#xe6;
  &#xe7;&#xe8;&#xe9;&#xea;&#xeb;&#xec;&#xed;&#xee;&#xef;&#xf0;&#xf1;&#xf2;
  &#xf3;&#xf4;&#xf5;&#xf6;&#xf7;&#xf8;&#xf9;&#xfa;&#xfb;&#xfc;&#xfd;&#xfe;
  &#xff;

So why doesn't the from_to() usage generate the same results?

The behaviour is (almost) by design - i.e. it happened that way and 
I decided it made a kind of sense. Using ASCII is considered as 
asking for 7-bit ness. If you want one of 8-bit super-sets use the 
one you want (iso8859-1 aka latin1 most likely, but perhaps one
of the windows ones with smart quotes, m-dash etc.)

IMO, the ASCII case is then wrong.  If you want to be "strict" about
the 7-bitness of ascii, then the "\xHH"s should not show up all, but
be '?'s, or something else.  Since the output is "\xHH"s, it seems
odd that FB_XMLCREF does not generate "&#HH;"s instead (see
above).

Maybe I am misunderatanding Encode's conversion operations, so
maybe it is a problem with the documentation not being clear about
this behavior.  But IMHO, what I am getting appears to be incorrect.

--ewh

<Prev in Thread] Current Thread [Next in Thread>