On December 23, 2002 at 22:41, Nick Ing-Simmons wrote:
Prints out the following:
1.83
ASCII -> UTF8
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97
After some further hacking, I notices that the success of the
FB_XMLCREF constant is not consistent. I add the following to the
script above:
my $src = $org;
print "\nISO-8859-3 -> ISO-8859-8\n";
from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
print $src, "\n";
ISO-8859-3 -> ISO-8859-8
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~\xC2\x80\xC2\x81\xC2\x82\xC2\x83\xC2\x84\xC2\x85\xC2\x86\xC2\x87\xC2\x88\xC2\x89\xC2\x8A\xC2\x8B\xC2\x8C\xC2\x8D\xC2ツ\x8F
\xC2\x90\xC2\x91\xC2\x92\xC2\x93\xC2\x94\xC2\x95\xC2\x96\xC2\x97\xC2\x98\xC2\x99\xC2\x9A\xC2\x9B\xC2\x9C\xC2\x9D\xC2\x9E\xC2\x9F\xC2\xA0Ħ˘贈造\xA5Ĥ則即
Look more closely, I will include some of the output again:
Ħ˘\x{25c0}\xA5ĤЖ
İŞĞĴ\xAD\xAEŻ\xB0ħ桶患ĥ係
ışğĵ\xBD\xBEżÀÁÂ\xC3
ÄĊĈÇÈÉÊËÌÍ
ÎÏ\xD0ÑÒÓÔĠÖ\xAAĜ
ÙÚÛÜŬŜßàáâ
\xE3äċĉçèéêëì
íîï\xF0ñòóôġö\xBA
ĝùúûüŭŝ˙
Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0
from_to is implemented by translating 'from' source to Unicode,
and 'to' destination.
This is what I figured since it is impractical to maintain unique
conversion tables between all types of character encodings.
The FB_XMLCREF happens on the 'to' side. Your original code suffers
from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.
So when you use an 8-bit encoding like iso8859-3 you don't see the problem.
See above where I highlight the problem characters. Also, with
the iso-2022-jp examples provided in my original post, illustrated
the problem.
BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
used for the ascii test, and entity references are generated for the
8-bit characters.
As I stated in my original post, the problem is that t/fallbacks.t
tests an undocumented (or poorly documented) Encode interface, and
it does not test the well-documented interface.
For example, extending from my code sample in the original post,
if you add the following:
my $meth = find_encoding('ascii');
my $src = $org;
my $dst = $meth->encode($src, FB_XMLCREF);
print $dst, "\n";
The following is generated:
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†
‡ˆ‰Š‹ŒŽ‘’
“”•–—˜™š›œž
Ÿ ¡¢£¤¥¦§¨©ª
«¬­®¯°±²³´µ¶
·¸¹º»¼½¾¿ÀÁÂ
ÃÄÅÆÇÈÉÊËÌÍÎ
ÏÐÑÒÓÔÕÖ×ØÙÚ
ÛÜÝÞßàáâãäåæ
çèéêëìíîïðñò
óôõö÷øùúûüýþ
ÿ
So why doesn't the from_to() usage generate the same results?
The behaviour is (almost) by design - i.e. it happened that way and
I decided it made a kind of sense. Using ASCII is considered as
asking for 7-bit ness. If you want one of 8-bit super-sets use the
one you want (iso8859-1 aka latin1 most likely, but perhaps one
of the windows ones with smart quotes, m-dash etc.)
IMO, the ASCII case is then wrong. If you want to be "strict" about
the 7-bitness of ascii, then the "\xHH"s should not show up all, but
be '?'s, or something else. Since the output is "\xHH"s, it seems
odd that FB_XMLCREF does not generate "&#HH;"s instead (see
above).
Maybe I am misunderatanding Encode's conversion operations, so
maybe it is a problem with the documentation not being clear about
this behavior. But IMHO, what I am getting appears to be incorrect.
--ewh