perl-unicode

Fallback problems with Encode

2002-12-19 22:30:04
Take the following code snippet:

    use Encode q(:all);
    print $Encode::VERSION, "\n";

    my $org = '';
    for my $i (0x20..0xFF){
        $org .= chr($i);
    }
    my $src = $org;
    print "\nASCII -> UTF8\n";
    from_to($src, 'ascii', 'utf8', FB_XMLCREF);
    print $src, "\n";

Prints out the following:

    1.83

    ASCII -> UTF8
     !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
    abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
    \x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97
    \x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7
    \xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7
    \xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7
    \xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7
    \xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7
    \xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7
    \xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF

(newlines added for readability).

Now, since FB_XMLCREF was specified, shouldn't the \xHH be &#xHH;?
Or, am I misunderstanding the use of FB_XMLCREF?

I examined t/fallback.t test cases, but none test from_to().  Actually,
none directly test the functional encode/decode routines either.
What appears to be tested is not a clearly documented calling
convention of using find_encoding().

After some further hacking, I notices that the success of the
FB_XMLCREF constant is not consistent.  I add the following to the
script above:

    my $src = $org;
    print "\nISO-8859-3 -> ISO-8859-8\n";
    from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
    print $src, "\n";

    my $src = $org;
    print "\nISO-8859-3 -> ISO-2022-JP\n";
    from_to($src, 'iso-8859-3', 'iso-2022-jp', FB_XMLCREF);
    print $src, "\n";

And got the following output (note, depending on your MUA,
non-printable characters may not show and the iso-2022-jp data
may look screwy):

    ISO-8859-3 -> ISO-8859-8
     !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
     
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F
     
\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0&#x126;&#x2d8;\x{25c0}\xA5&#x124;Ж
     &#x130;&#x15e;&#x11e;&#x134;\xAD\xAE&#x17b;\xB0&#x127;桶患&#x125;係
     &#x131;&#x15f;&#x11f;&#x135;\xBD\xBE&#x17c;&#xc0;&#xc1;&#xc2;\xC3
     &#xc4;&#x10a;&#x108;&#xc7;&#xc8;&#xc9;&#xca;&#xcb;&#xcc;&#xcd;
     &#xce;&#xcf;\xD0&#xd1;&#xd2;&#xd3;&#xd4;&#x120;&#xd6;\xAA&#x11c;
     &#xd9;&#xda;&#xdb;&#xdc;&#x16c;&#x15c;&#xdf;&#xe0;&#xe1;&#xe2;
     \xE3&#xe4;&#x10b;&#x109;&#xe7;&#xe8;&#xe9;&#xea;&#xeb;&#xec;
     &#xed;&#xee;&#xef;\xF0&#xf1;&#xf2;&#xf3;&#xf4;&#x121;&#xf6;\xBA
     &#x11d;&#xf9;&#xfa;&#xfb;&#xfc;&#x16d;&#x15d;&#x2d9;

    ISO-8859-3 -> ISO-2022-JP
     !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
     
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D
     
\x{008e}\x{008f}\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\x{00a0}
     扤扤£扤\xA5扤§¨扤扤扤扤
     \x{00ad}\xAE扤°扤\x{00b2}\x{00b3}´
     \x{00b5}扤\x{00b7}扤扤扤扤扤\x{00bd}\xBE
     扤扤扤扤\xC3扤扤扤扤扤扤扤扤扤扤扤扤\xD0
     扤扤扤扤扤扤×扤扤扤扤扤扤扤扤扤扤扤\xE3
     扤扤扤扤扤扤扤扤扤扤扤扤\xF0扤扤扤扤扤扤÷
     扤扤扤扤扤扤扤扤

(again, newlines added for readability).

In the ISO-8859-3 -> ISO-8859-8 case, we get a mixture of
XML entity refs and perlqq sequences.  In the ISO-8859-3 -> ISO-2022-JP,
all we get is perlqq sequences.

Another problem is when I try to use FB_DEFAULT.  I still get perlqq
sequences instead of the unknown character for some characters,
at least when trying to goto iso-2022-jp:

    my $src = $org;
    print "\nISO-8859-3 -> ISO-2022-JP (default)\n";
    from_to($src, 'iso-8859-3', 'iso-2022-jp', FB_DEFAULT);
    print $src, "\n";

Generates the following:

    ISO-8859-3 -> ISO-2022-JP (default)
     !"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
     
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D
     
\x{008e}\x{008f}\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\x{00a0}
     扤扤£扤\x{fffd}扤§¨扤扤扤扤
     \x{00ad}\x{fffd}扤°扤\x{00b2}\x{00b3}
     ´\x{00b5}扤\x{00b7}扤扤扤扤扤
     \x{00bd}\x{fffd}扤扤扤扤\x{fffd}
     扤扤扤扤扤扤扤扤扤扤扤扤\x{fffd}扤扤扤扤扤扤
     ×扤扤扤扤扤扤扤扤扤扤扤\x{fffd}
     扤扤扤扤扤扤扤扤扤扤扤扤\x{fffd}扤扤扤扤扤扤
     ÷扤扤扤扤扤扤扤扤

(again, newlines added for readability).

Any insights to this behavior will be appreciated.

Thanks,

--ewh
-- 
Earl Hood, <earl(_at_)earlhood(_dot_)com>
Web: <http://www.earlhood.com/>
PGP Public Key: <http://www.earlhood.com/gpgpubkey.txt>

<Prev in Thread] Current Thread [Next in Thread>