Earl Hood <earl(_at_)earlhood(_dot_)com> writes:
ISO-8859-3 -> ISO-8859-8
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~ÂÂÂÂÂÂ
ÂÂÂÂÂÂÂÂÂÂ
 Ħ˘Â£Â¤\xA5Ĥ§¨
Look more closely, I will include some of the output again:
Ħ˘£¤\xA5Ĥ§¨
İŞĞĴ\xAEŻ°ħ²³´µĥ·¸
ışğĵ½\xBEżÀÁÂ\xC3
ÄĊĈÇÈÉÊËÌÍ
ÎÏ\xD0ÑÒÓÔĠÖªĜ
ÙÚÛÜŬŜßàáâ
\xE3äċĉçèéêëì
íîï\xF0ñòóôġöº
ĝùúûüŭŝ˙
Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0
The FB_XMLCREF happens on the 'to' side. Your original code suffers
from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.
So when you use an 8-bit encoding like iso8859-3 you don't see the problem.
See above where I highlight the problem characters.
So I was too glib. You see your "problem" when the octet is not defined
in the source character set. e.g. 0xA5 is not given a meaning by iso-8859-3.
Also, with
the iso-2022-jp examples provided in my original post, illustrated
the problem.
Possibly. iso-2022-jp is an escape encoding and has a whole slew of other
things to worry about.
BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
used for the ascii test, and entity references are generated for the
8-bit characters.
As I stated in my original post, the problem is that t/fallbacks.t
tests an undocumented (or poorly documented) Encode interface, and
it does not test the well-documented interface.
Whether un(der)?documented or not the object style used in t/fallback.t
is the way the internals work.
You say "... it is impractical to maintain unique
conversion tables between all types of character encodings." - it is even
more impractical to _test_ them that way.
...
For example, extending from my code sample in the original post,
if you add the following:
my $meth = find_encoding('ascii');
my $src = $org;
my $dst = $meth->encode($src, FB_XMLCREF);
print $dst, "\n";
The following is generated:
!"#$%&'()*+,-./0123456789:;<=>?(_at_)ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†
‡ˆ‰Š‹ŒŽ‘’
“”•–—˜™š›œž
Ÿ ¡¢£¤¥¦§¨©ª
«¬­®¯°±²³´µ¶
·¸¹º»¼½¾¿ÀÁÂ
ÃÄÅÆÇÈÉÊËÌÍÎ
ÏÐÑÒÓÔÕÖ×ØÙÚ
ÛÜÝÞßàáâãäåæ
çèéêëìíîïðñò
óôõö÷øùúûüýþ
ÿ
So why doesn't the from_to() usage generate the same results?
Because the ->decode side has removed the non-representable octets
and replaced them with 4-chars each: \xHH.
So there are no hi-bit chars to cause entity refs.
IMO, the ASCII case is then wrong. If you want to be "strict" about
the 7-bitness of ascii, then the "\xHH"s should not show up all, but
be '?'s, or something else.
You can get that (I believe) by passing appropriate fallback options to
->decode of ASCII. I personally dislike fallback to '?' as it looses
information in a way that is hard to back-track - which is why default
fallback is \xHH.
Since the output is "\xHH"s, it seems
odd that FB_XMLCREF does not generate "&#HH;"s instead (see
above).
XMLCREFs are Unicode - ASCII 0xA0 (e.g.) is NOT Unicode it is undefined.
Maybe I am misunderatanding Encode's conversion operations, so
maybe it is a problem with the documentation not being clear about
this behavior. But IMHO, what I am getting appears to be incorrect.
And IMHO you are getting what I "designed" it to produce ;-)
I strongly recommend doing conversions in two steps explcitly - that way
you can get whatever you want.
I am also willing to concede that documentation could be improved :-)
--ewh
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/