perl-unicode

Re: Can from_to($s, SRC, TGT) leave chars missing in TGT unchanged?

2002-10-11 02:30:04
Anton Shcherbinin <useperl(_at_)fastmail(_dot_)fm> writes:
Encode::from_to($string,  SOURCE,  TARGET)  changes all characters which
are  missing in TARGET into '?' chars (ok, to be exact <subchar>s). This
is  probably  the  most  reasonable *default* behavior. But I could give
a  couple  of  arguments  why  other behavior (not to change those chars
missing  in  target encoding) is also reasonable and sometimes much more
reasonable.

My native language, Russian, suffers from having FIVE one-byte encodings
(windows-1251,  koi8-r, iso-8859-5, cp866, "MacCyrillic") which are used
everywhere  alternately  more   or   less   often.   Conversions  from 1
encoding to another  are  very often, and sometimes we just have to make
the reverse conversion.

We get the same "problem" in english with Windows "smart quotes" 
and other MSWord-isms being sent out as supposedly iso-8859-1 when 
they really meant windows-1250 or whatever.

The problem with just retaining the original is that unless one encoding
is a strict superset of the other and code points are the same for 
the same characters the meaning may be corrupted. You are in general 
better off "leaving" it as super-set encoding or UTF-8! If you don't 
like the '?' there are fallback schemes to put \x{uuuu} or HTML escapes
which at least give the reader a hint as to what was there.


MY   QUESTION   IS:  how  can I convert text from 1 one-byte encoding to
another without changing into '?' (leaving unchanged) characters missing
in target encoding?

There is no built-in way to do it directly. And from_to is particularly 
problematic as the options arg is applied to both the decode and the 
re-encode steps - where as you only want to special case the re-encode.

You can do it via internal form something like this:

sub sloppy_from_to
{
     my ($src,$SOURCE,$TARGET) = @_;
     my $from = find_encoding($SOURCE);
     my $to   = find_encoding($TARGET);
     my $dest = '';
     # Assume all of $src is representable in internal form
     my $uni = $from->decode($src);
     while (length($uni))
      {
       $dest .= $to->encode($uni,ENCODE_RETURN_ON_ERR);
       if (length($uni)) {
         # Not all converted...
         # some ad. hoc. scheme to "copy" the non representable char
         # e.g. chop off 1 char, and re-encode and append that
         $dest .= $from->encode(substr($uni,0,1,''));       
       } 
     return $dest;      
}


I did try to find it out myself. At some point I thought that
from_to($string, SRCenc, TGTenc, ENCODE_LEAVE_SRC)
is  just  what  I  wanted,  because  it  LEAVEs  those chars in SRC that
ENCODE_NOREP...  but  unfortunately  no,  it  leaves  all  source string
untouched unconditionally.

Thanks in advance for any clues.

If  my  English and/or my question is far from clear, please tell me and
I'll do my best to rewrite it in other words.
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

<Prev in Thread] Current Thread [Next in Thread>