Take the following code snippet:
use Encode q(:all);
print $Encode::VERSION, "\n";
my $org = '';
for my $i (0x20..0xFF){
$org .= chr($i);
}
my $src = $org;
print "\nASCII -> UTF8\n";
from_to($src, 'ascii', 'utf8', FB_XMLCREF);
print $src, "\n";
Prints out the following:
1.83
ASCII -> UTF8
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97
\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7
\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7
\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7
\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7
\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7
\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7
\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF
(newlines added for readability).
Now, since FB_XMLCREF was specified, shouldn't the \xHH be &#xHH;?
Or, am I misunderstanding the use of FB_XMLCREF?
I examined t/fallback.t test cases, but none test from_to(). Actually,
none directly test the functional encode/decode routines either.
What appears to be tested is not a clearly documented calling
convention of using find_encoding().
After some further hacking, I notices that the success of the
FB_XMLCREF constant is not consistent. I add the following to the
script above:
my $src = $org;
print "\nISO-8859-3 -> ISO-8859-8\n";
from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
print $src, "\n";
my $src = $org;
print "\nISO-8859-3 -> ISO-2022-JP\n";
from_to($src, 'iso-8859-3', 'iso-2022-jp', FB_XMLCREF);
print $src, "\n";
And got the following output (note, depending on your MUA,
non-printable characters may not show and the iso-2022-jp data
may look screwy):
ISO-8859-3 -> ISO-8859-8
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F
\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0Ħ˘\x{25c0}\xA5ĤЖ
İŞĞĴ\xAD\xAEŻ\xB0ħ桶患ĥ係
ışğĵ\xBD\xBEżÀÁÂ\xC3
ÄĊĈÇÈÉÊËÌÍ
ÎÏ\xD0ÑÒÓÔĠÖ\xAAĜ
ÙÚÛÜŬŜßàáâ
\xE3äċĉçèéêëì
íîï\xF0ñòóôġö\xBA
ĝùúûüŭŝ˙
ISO-8859-3 -> ISO-2022-JP
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D
\x{008e}\x{008f}\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\x{00a0}
扤扤£扤\xA5扤§¨扤扤扤扤
\x{00ad}\xAE扤°扤\x{00b2}\x{00b3}´
\x{00b5}扤\x{00b7}扤扤扤扤扤\x{00bd}\xBE
扤扤扤扤\xC3扤扤扤扤扤扤扤扤扤扤扤扤\xD0
扤扤扤扤扤扤×扤扤扤扤扤扤扤扤扤扤扤\xE3
扤扤扤扤扤扤扤扤扤扤扤扤\xF0扤扤扤扤扤扤÷
扤扤扤扤扤扤扤扤
(again, newlines added for readability).
In the ISO-8859-3 -> ISO-8859-8 case, we get a mixture of
XML entity refs and perlqq sequences. In the ISO-8859-3 -> ISO-2022-JP,
all we get is perlqq sequences.
Another problem is when I try to use FB_DEFAULT. I still get perlqq
sequences instead of the unknown character for some characters,
at least when trying to goto iso-2022-jp:
my $src = $org;
print "\nISO-8859-3 -> ISO-2022-JP (default)\n";
from_to($src, 'iso-8859-3', 'iso-2022-jp', FB_DEFAULT);
print $src, "\n";
Generates the following:
ISO-8859-3 -> ISO-2022-JP (default)
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D
\x{008e}\x{008f}\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\x{00a0}
扤扤£扤\x{fffd}扤§¨扤扤扤扤
\x{00ad}\x{fffd}扤°扤\x{00b2}\x{00b3}
´\x{00b5}扤\x{00b7}扤扤扤扤扤
\x{00bd}\x{fffd}扤扤扤扤\x{fffd}
扤扤扤扤扤扤扤扤扤扤扤扤\x{fffd}扤扤扤扤扤扤
×扤扤扤扤扤扤扤扤扤扤扤\x{fffd}
扤扤扤扤扤扤扤扤扤扤扤扤\x{fffd}扤扤扤扤扤扤
÷扤扤扤扤扤扤扤扤
(again, newlines added for readability).
Any insights to this behavior will be appreciated.
Thanks,
--ewh
--
Earl Hood, <earl(_at_)earlhood(_dot_)com>
Web: <http://www.earlhood.com/>
PGP Public Key: <http://www.earlhood.com/gpgpubkey.txt>