Anton Shcherbinin <useperl(_at_)fastmail(_dot_)fm> writes:
Encode::from_to($string, SOURCE, TARGET) changes all characters which
are missing in TARGET into '?' chars (ok, to be exact <subchar>s). This
is probably the most reasonable *default* behavior. But I could give
a couple of arguments why other behavior (not to change those chars
missing in target encoding) is also reasonable and sometimes much more
reasonable.
My native language, Russian, suffers from having FIVE one-byte encodings
(windows-1251, koi8-r, iso-8859-5, cp866, "MacCyrillic") which are used
everywhere alternately more or less often. Conversions from 1
encoding to another are very often, and sometimes we just have to make
the reverse conversion.
We get the same "problem" in english with Windows "smart quotes"
and other MSWord-isms being sent out as supposedly iso-8859-1 when
they really meant windows-1250 or whatever.
The problem with just retaining the original is that unless one encoding
is a strict superset of the other and code points are the same for
the same characters the meaning may be corrupted. You are in general
better off "leaving" it as super-set encoding or UTF-8! If you don't
like the '?' there are fallback schemes to put \x{uuuu} or HTML escapes
which at least give the reader a hint as to what was there.
MY QUESTION IS: how can I convert text from 1 one-byte encoding to
another without changing into '?' (leaving unchanged) characters missing
in target encoding?
There is no built-in way to do it directly. And from_to is particularly
problematic as the options arg is applied to both the decode and the
re-encode steps - where as you only want to special case the re-encode.
You can do it via internal form something like this:
sub sloppy_from_to
{
my ($src,$SOURCE,$TARGET) = @_;
my $from = find_encoding($SOURCE);
my $to = find_encoding($TARGET);
my $dest = '';
# Assume all of $src is representable in internal form
my $uni = $from->decode($src);
while (length($uni))
{
$dest .= $to->encode($uni,ENCODE_RETURN_ON_ERR);
if (length($uni)) {
# Not all converted...
# some ad. hoc. scheme to "copy" the non representable char
# e.g. chop off 1 char, and re-encode and append that
$dest .= $from->encode(substr($uni,0,1,''));
}
return $dest;
}
I did try to find it out myself. At some point I thought that
from_to($string, SRCenc, TGTenc, ENCODE_LEAVE_SRC)
is just what I wanted, because it LEAVEs those chars in SRC that
ENCODE_NOREP... but unfortunately no, it leaves all source string
untouched unconditionally.
Thanks in advance for any clues.
If my English and/or my question is far from clear, please tell me and
I'll do my best to rewrite it in other words.
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/